skip to main content
research-article

An Object Localization-based Dense Image Captioning Framework in Hindi

Published:27 December 2022Publication History
Skip Abstract Section

Abstract

Dense image captioning is a task that requires generating localized captions in natural language for multiple regions of an image. This task leverages its functionalities from both computer vision for recognizing regions in an image and natural language processing for generating captions. Numerous works have been carried out on dense image captioning for resource-rich languages like English; however, resource-poor languages like Hindi are not explored. Hindi is one of India’s official languages and is the third most spoken language in the world. This article proposes a dense image captioning model to describe different segments of an image by generating more than one caption in the Hindi language. For localized image recognition and language modeling, we employ Faster R-CNN and Long Short-Term Memory (LSTM), respectively. Apart from this, we conduct various experiments using gated recurrent units (GRUs) and attention mechanism. By manually translating the well-known Visual Genome dataset from English to Hindi, a dataset has been created for dense image captioning in Hindi. The experiments conducted on the newly constructed Hindi dense image captioning dataset illustrate the efficacy of the proposed method over the state-of-the-art methods.

REFERENCES

  1. [1] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google ScholarGoogle Scholar
  3. [3] Bell Sean, Zitnick C. Lawrence, Bala Kavita, and Girshick Ross. 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 28742883.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Cho Kyunghyun, Merriënboer Bart Van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google ScholarGoogle Scholar
  5. [5] Cornia Marcella, Stefanini Matteo, Baraldi Lorenzo, and Cucchiara Rita. 2020. Meshed-memory transformer for image captioning. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248255.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Deshpande Aditya, Aneja Jyoti, Wang Liwei, Schwing Alexander G., and Forsyth David. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Dhir Rijul, Mishra Santosh Kumar, Saha Sriparna, and Bhattacharyya Pushpak. 2019. A deep attention based framework for image caption generation in Hindi language. Computación y Sistemas 23, 3 (2019), 693–701.Google ScholarGoogle Scholar
  9. [9] Donahue Jeffrey, Hendricks Lisa Anne, Guadarrama Sergio, Rohrbach Marcus, Venugopalan Subhashini, Saenko Kate, and Darrell Trevor. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 26252634.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Elliott Desmond and Keller Frank. 2013. Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 12921302.Google ScholarGoogle Scholar
  11. [11] Everingham Mark, Gool Luc Van, Williams Christopher K. I., Winn John, and Zisserman Andrew. 2010. The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Fang Hao, Gupta Saurabh, Iandola Forrest, Srivastava Rupesh K., Deng Li, Dollár Piotr, Gao Jianfeng, He Xiaodong, Mitchell Margaret, Platt John C., and others. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14731482.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Farhadi Ali, Hejrati Mohsen, Sadeghi Mohammad Amin, Young Peter, Rashtchian Cyrus, Hockenmaier Julia, and Forsyth David. 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision. Springer, 1529.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Feng Yang, Ma Lin, Liu Wei, and Luo Jiebo. 2019. Unsupervised image captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Ghataoura Darminder and Ogbonnaya Sam. 2021. Application of image captioning and retrieval to support military decision making. In 2021 International Conference on Military Communication and Information Systems (ICMCIS’21). IEEE, 18.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Gill Karanjit, Saha Sriparna, and Mishra Santosh Kumar. 2021. Dense image captioning in hindi. In 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC’21). IEEE, 28942899.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Huang Lun, Wang Wenmin, Chen Jie, and Wei Xiao-Yong. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 46344643.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Jiang Wenhao, Ma Lin, Chen Xinpeng, Zhang Hanwang, and Liu Wei. 2018. Learning to guide decoding for image captioning. In 32nd AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Jin Junqi, Fu Kun, Cui Runpeng, Sha Fei, and Zhang Changshui. 2015. Aligning where to see and what to tell: Image caption with region-based attention and scene factorization. arXiv preprint arXiv:1506.06272 (2015).Google ScholarGoogle Scholar
  22. [22] Johnson Justin, Karpathy Andrej, and Fei-Fei Li. 2016. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 45654574.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31283137.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Karpathy Andrej, Joulin Armand, and Fei-Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. arXiv preprint arXiv:1406.5679 (2014).Google ScholarGoogle Scholar
  25. [25] Ke Lei, Pei Wenjie, Li Ruiyu, Shen Xiaoyong, and Tai Yu-Wing. 2019. Reflective decoding network for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 88888897.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Kim Dong-Jin, Choi Jinsoo, Oh Tae-Hyun, and Kweon In So. 2019. Dense relational captioning: Triple-stream networks for relationship-based captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 62716280.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A., and others. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.Google ScholarGoogle Scholar
  28. [28] Kulkarni Girish, Premraj Visruth, Dhar Sagnik, Li Siming, Choi Yejin, Berg Alexander C., and Berg Tamara L.. 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th CVPR. Citeseer.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Li Siming, Kulkarni Girish, Berg Tamara L., Berg Alexander C., and Choi Yejin. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220228.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Liu Junhao, Wang Kai, Xu Chunpu, Zhao Zhou, Xu Ruifeng, Shen Ying, and Yang Min. 2020. Interactive dual generative adversarial networks for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence 34, 7 (2020), 11588–11595.Google ScholarGoogle Scholar
  31. [31] MacLeod Haley, Bennett Cynthia L., Morris Meredith Ringel, and Cutrell Edward. 2017. Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 59885999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Mao Junhua, Xu Wei, Yang Yi, Wang Jiang, Huang Zhiheng, and Yuille Alan. 2014. Deep captioning with multimodal recurrent neural networks (M-RNN). arXiv preprint arXiv:1412.6632 (2014).Google ScholarGoogle Scholar
  33. [33] Mishra Santosh Kumar, Dhir Rijul, Saha Sriparna, and Bhattacharyya Pushpak. 2021. A hindi image caption generation framework using deep learning. Transactions on Asian and Low-resource Language Information Processing 20, 2 (2021), 119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Mishra Santosh Kumar, Dhir Rijul, Saha Sriparna, Bhattacharyya Pushpak, and Singh Amit Kumar. 2021. Image captioning in Hindi language using transformer networks. Computers & Electrical Engineering 92 (2021), 107114.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Mishra Santosh Kumar, Peethala Mahesh Babu, Saha Sriparna, and Bhattacharyya Pushpak. 2021. An information multiplexed encoder-decoder network for image captioning in Hindi. In 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC’21). IEEE, 30193024.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Mishra Santosh Kumar, Rai Gaurav, Saha Sriparna, and Bhattacharyya Pushpak. 2021. Efficient channel attention based encoder–decoder approach for image captioning in Hindi. Transactions on Asian and Low-Resource Language Information Processing 21, 3 (2021), 117.Google ScholarGoogle Scholar
  37. [37] Monteiro João, Kitamoto Asanobu, and Martins Bruno. 2017. Situational awareness from social media photographs using automated image captioning. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA’17). IEEE, 203211.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Mottaghi Roozbeh, Chen Xianjie, Liu Xiaobai, Cho Nam-Gyu, Lee Seong-Whan, Fidler Sanja, Urtasun Raquel, and Yuille Alan. 2014. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 891898.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Pan Yingwei, Yao Ting, Li Yehao, and Mei Tao. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1097110980.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311318.Google ScholarGoogle Scholar
  41. [41] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 9199.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2016), 11371149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Saini Naveen, Saha Sriparna, Bhattacharyya Pushpak, Mrinal Shubhankar, and Mishra Santosh Kumar. 2021. On multimodal microblog summarization. IEEE Transactions on Computational Social Systems (2021), 1–13. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  45. [45] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31563164.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Wang Leiquan, Chu Xiaoliang, Zhang Weishan, Wei Yiwei, Sun Weichen, and Wu Chunlei. 2018. Social image captioning: Exploring visual attention and user attention. Sensors 18, 2 (2018), 646.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Xu Kelvin, Ba Jimmy, Kiros Ryan, Cho Kyunghyun, Courville Aaron, Salakhudinov Ruslan, Zemel Rich, and Bengio Yoshua. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. PMLR, 20482057.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Yang Linjie, Tang Kevin, Yang Jianchao, and Li Li-Jia. 2017. Dense captioning with joint inference and visual context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21932202.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684699.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] You Quanzeng, Jin Hailin, Wang Zhaowen, Fang Chen, and Luo Jiebo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46514659.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Zhou Luowei, Palangi Hamid, Zhang Lei, Hu Houdong, Corso Jason J., and Gao Jianfeng. 2020. Unified vision-language pre-training for image captioning and VQA. In AAAI. 1304113049.Google ScholarGoogle Scholar

Index Terms

  1. An Object Localization-based Dense Image Captioning Framework in Hindi

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 2
      February 2023
      624 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3572719
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 December 2022
      • Online AM: 22 August 2022
      • Accepted: 16 May 2022
      • Revised: 26 March 2022
      • Received: 21 November 2021
      Published in tallip Volume 22, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)237
      • Downloads (Last 6 weeks)26

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!