skip to main content
research-article

Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in Hindi

Published:24 March 2023Publication History
Skip Abstract Section

Abstract

In sequence-to-sequence modeling tasks, such as image captioning, machine translation, and visual question answering, encoder-decoder architectures are state of the art. An encoder, convolutional neural network (CNN) encodes input images into fixed dimensional vector representation in the image captioning task, whereas a decoder, a recurrent neural network, performs language modeling and generates the target descriptions. Recent CNNs use the same operation over every pixel; however, all the image pixels are not equally important. To address this, the proposed method uses a dynamic convolution-based encoder for image encoding or feature extraction, Long-Short-Term-Memory as a decoder for language modeling, and X-Linear attention to make the system robust. Encoders, attentions, and decoders are important aspects of the image captioning task; therefore, we experiment with various encoders, decoders, and attention mechanisms. Most of the works for image captioning have been carried out for the English language in the existing literature. We propose a novel approach for caption generation from images in Hindi. Hindi, widely spoken in South Asia and India, is the fourth most-spoken language globally; it is India’s official language. The proposed method utilizes dynamic convolution operation on the encoder side to obtain a better image encoding quality. The Hindi image captioning dataset is manually created by translating the popular MSCOCO dataset from English to Hindi. In terms of BLEU scores, the performance of the proposed method is compared with other baselines, and the results obtained show that the proposed method outperforms different baselines. Manual human assessment in terms of adequacy and fluency of the captions generated further determines the efficacy of the proposed method in generating good-quality captions.

REFERENCES

  1. [1] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2014. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014). Retrieved from https://arxiv.org/abs/s1409.0473.Google ScholarGoogle Scholar
  3. [3] Cho Kyunghyun, Merriënboer Bart Van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Conference on Empirical Methods in Natural Language Processing. Retrieved from https://arxiv.org/abs/1406.1078.Google ScholarGoogle Scholar
  4. [4] Cornia Marcella, Stefanini Matteo, Baraldi Lorenzo, and Cucchiara Rita. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Dalal Navneet and Triggs Bill. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. IEEE, 886893.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Deshpande Aditya, Aneja Jyoti, Wang Liwei, Schwing Alexander G., and Forsyth David. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Dhir Rijul, Mishra Santosh Kumar, Saha Sriparna, and Bhattacharyya Pushpak. 2019. A deep attention based framework for image caption generation in Hindi language. Comput. Sist. 23, 3 (2019).Google ScholarGoogle Scholar
  8. [8] Elliott Desmond and Keller Frank. 2013. Image description using visual dependency representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 12921302.Google ScholarGoogle Scholar
  9. [9] Farhadi Ali, Hejrati Mohsen, Sadeghi Mohammad Amin, Young Peter, Rashtchian Cyrus, Hockenmaier Julia, and Forsyth David. 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision. Springer, 1529.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Feng Yang, Ma Lin, Liu Wei, and Luo Jiebo. 2019. Unsupervised image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Fukui Akira, Park Dong Huk, Yang Daylen, Rohrbach Anna, Darrell Trevor, and Rohrbach Marcus. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. Conference on Empirical Methods in Natural Language Processing. Retrieved from https://arxiv.org/abs/1606.01847.Google ScholarGoogle Scholar
  12. [12] Gao Yang, Beijbom Oscar, Zhang Ning, and Darrell Trevor. 2016. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 317326.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Gary Jane and Rubino Carl. 2001. Facts About the World’s Languages: An Encyclopedia of the World’s Major Languages.Google ScholarGoogle Scholar
  14. [14] Gill Karanjit, Saha Sriparna, and Mishra Santosh Kumar. 2021. Dense image captioning in Hindi. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC’21). IEEE, 28942899.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Gong Yunchao, Wang Liwei, Hodosh Micah, Hockenmaier Julia, and Lazebnik Svetlana. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision. Springer, 529545.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Hodosh Micah, Young Peter, and Hockenmaier Julia. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47 (2013), 853899.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Hu Jie, Shen Li, and Sun Gang. 2018. Squeeze-and-excitation networks, 7132–7141. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle Scholar
  20. [20] Jang Eric, Gu Shixiang, and Poole Ben. 2016. Categorical reparameterization with gumbel-softmax. International Conference on Learning Representations. Retrieved from https://arxiv.org/abs/1611.01144.Google ScholarGoogle Scholar
  21. [21] Jiang Wenhao, Ma Lin, Chen Xinpeng, Zhang Hanwang, and Liu Wei. 2018. Learning to guide decoding for image captioning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31283137.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Kaur Amanpreet, Kumar Munish, and Jindal Manish Kumar. 2022. Shi-Tomasi corner detector for cattle identification from muzzle print image pattern. Ecol. Inf. 68 (2022), 101549.Google ScholarGoogle Scholar
  24. [24] Kim Jin-Hwa, Jun Jaehyun, and Zhang Byoung-Tak. 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems. 15641574.Google ScholarGoogle Scholar
  25. [25] Kulkarni Girish, Premraj Visruth, Dhar Sagnik, Li Siming, Choi Yejin, Berg Alexander C., and Berg Tamara L.. 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). Citeseer.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Li Siming, Kulkarni Girish, Berg Tamara L., Berg Alexander C., and Choi Yejin. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220228.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Liu Junhao, Wang Kai, Xu Chunpu, Zhao Zhou, Xu Ruifeng, Shen Ying, and Yang Min. 2020. Interactive dual generative adversarial networks for image captioning. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’20). 1158811595.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Lowe David G.. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 2 (2004), 91110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Lu Jiasen, Xiong Caiming, Parikh Devi, and Socher Richard. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 375383.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Luong Minh-Thang, Pham Hieu, and Manning Christopher D.. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1412–1421.Google ScholarGoogle Scholar
  31. [31] MacLeod Haley, Bennett Cynthia L., Morris Meredith Ringel, and Cutrell Edward. 2017. Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 59885999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Mao Junhua, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). 3rd International Conference on Learning Representations (ICLR’15, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings), Yoshua Bengio and Yann LeCun. http://arxiv.org/abs/1412.6632Google ScholarGoogle Scholar
  33. [33] Mishra Santosh Kumar, Dhir Rijul, Saha Sriparna, and Bhattacharyya Pushpak. 2021. A Hindi image caption generation framework using deep learning. Trans. Asian Low-Resour. Lang. Inf. Process. 20, 2 (2021), 119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Mishra Santosh Kumar, Dhir Rijul, Saha Sriparna, Bhattacharyya Pushpak, and Singh Amit Kumar. 2020. Image captioning in Hindi language using transformer networks. (unpublished).Google ScholarGoogle Scholar
  35. [35] Mishra Santosh Kumar, Dhir Rijul, Saha Sriparna, Bhattacharyya Pushpak, and Singh Amit Kumar. 2021. Image captioning in Hindi language using transformer networks. Comput. Electr. Eng. 92 (2021), 107114.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Mishra Santosh Kumar, Peethala Mahesh Babu, Saha Sriparna, and Bhattacharyya Pushpak. 2021. An information multiplexed encoder-decoder network for image captioning in Hindi. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC’21). IEEE, 30193024.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Mishra Santosh Kumar, Rai Gaurav, Saha Sriparna, and Bhattacharyya Pushpak. 2021. Efficient channel attention based encoder–decoder approach for image captioning in Hindi. Trans. Asian Low-Resour. Lang. Inf. Process. 21, 3 (2021), 117.Google ScholarGoogle Scholar
  38. [38] Mishra Santosh Kumar, Saha Sriparna, and Bhattacharyya Pushpak. 2021. A scaled encoder decoder network for image captioning in Hindi. In Proceedings of the 18th International Conference on Natural Language Processing (ICON’21). 251260.Google ScholarGoogle Scholar
  39. [39] Mishra Santosh Kumar, Saha Sriparna, and Bhattacharyya Pushpak. 2022. An object localization based dense image captioning framework in Hindi. Trans. Asian Low-Resour. Lang. Inf. Process. (2022).Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Mushtaq Faisel, Misgar Muzafar Mehraj, Kumar Munish, and Khurana Surinder Singh. 2021. UrduDeepNet: Offline handwritten Urdu character recognition using deep neural network. Neural Comput. Appl. 33, 22 (2021), 1522915252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Narang Sonika Rani, Jindal Manish Kumar, Ahuja Shruti, and Kumar Munish. 2020. On the recognition of Devanagari ancient handwritten characters using SIFT and Gabor features. Soft Comput. 24, 22 (2020), 1727917289.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Narang Sonika Rani, Jindal Manish Kumar, and Kumar Munish. 2020. Ancient text recognition: A review. Artif. Intell. Rev. 53, 8 (2020), 55175558.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Narang Sonika Rani, Kumar Munish, and Jindal Manish Kumar. 2021. DeepNetDevanagari: A deep learning model for Devanagari ancient character recognition. Multimedia Tools Appl. 80, 13 (2021), 2067120686.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Ojala Timo, Pietikäinen Matti, and Mäenpää Topi. 2000. Gray scale and rotation invariant texture classification with local binary patterns. In European Conference on Computer Vision. Springer, 404420.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Pan Yingwei, Yao Ting, Li Yehao, and Mei Tao. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1097110980.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311318.Google ScholarGoogle Scholar
  47. [47] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 9199.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Szegedy Christian, Ioffe Sergey, Vanhoucke Vincent, and Alemi Alexander A.. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 59986008.Google ScholarGoogle Scholar
  50. [50] Verelst Thomas and Tuytelaars Tinne. 2020. Dynamic convolutions: Exploiting spatial sparsity for faster inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23202329.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31563164.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2016. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2016), 652663.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Xu Guanghui, Niu Shuaicheng, Tan Mingkui, Luo Yucheng, Du Qing, and Wu Qi. 2021. Towards accurate text-based image captioning with content diversity exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1263712646.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Xu Kelvin, Ba Jimmy, Kiros Ryan, Cho Kyunghyun, Courville Aaron, Salakhudinov Ruslan, Zemel Rich, and Bengio Yoshua. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 20482057.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Yang Zhilin, Yuan Ye, Wu Yuexin, Cohen William W., and Salakhutdinov Ruslan R.. 2016. Review networks for caption generation. In Advances in Neural Information Processing Systems. 23612369.Google ScholarGoogle Scholar
  56. [56] You Quanzeng, Jin Hailin, Wang Zhaowen, Fang Chen, and Luo Jiebo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46514659.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Yu Chaojian, Zhao Xinyi, Zheng Qi, Zhang Peng, and You Xinge. 2018. Hierarchical bilinear pooling for fine-grained visual recognition. In Proceedings of the European Conference on Computer Vision (ECCV’18). 574589.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Zhou Luowei, Palangi Hamid, Zhang Lei, Hu Houdong, Corso Jason J., and Gao Jianfeng. 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’20). 1304113049.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in Hindi

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 4
      April 2023
      682 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3588902
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 March 2023
      • Online AM: 26 December 2022
      • Accepted: 12 November 2022
      • Revised: 28 June 2022
      • Received: 10 December 2021
      Published in tallip Volume 22, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)153
      • Downloads (Last 6 weeks)35

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!