Abstract
A bottom-up and top-down attention mechanism has led to the revolutionizing of image captioning techniques, which enables object-level attention for multi-step reasoning over all the detected objects. However, when humans describe an image, they often apply their own subjective experience to focus on only a few salient objects that are worthy of mention, rather than all objects in this image. The focused objects are further allocated in linguistic order, yielding the “object sequence of interest” to compose an enriched description. In this work, we present the Bottom-up and Top-down Object inference Network (BTO-Net), which novelly exploits the object sequence of interest as top-down signals to guide image captioning. Technically, conditioned on the bottom-up signals (all detected objects), an LSTM-based object inference module is first learned to produce the object sequence of interest, which acts as the top-down prior to mimic the subjective experience of humans. Next, both of the bottom-up and top-down signals are dynamically integrated via an attention mechanism for sentence generation. Furthermore, to prevent the cacophony of intermixed cross-modal signals, a contrastive learning-based objective is involved to restrict the interaction between bottom-up and top-down signals, and thus leads to reliable and explainable cross-modal reasoning. Our BTO-Net obtains competitive performances on the COCO benchmark, in particular, 134.1% CIDEr on the COCO Karpathy test split. Source code is available at https://github.com/YehLi/BTO-Net.
- [1] . 2016. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision. Springer, 382–398.Google Scholar
Cross Ref
- [2] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google Scholar
Cross Ref
- [3] . 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations (ICLR’15).Google Scholar
- [4] . 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.Google Scholar
Digital Library
- [5] . 2021. Unpaired image captioning with semantic-constrained self-learning. IEEE Transactions on Multimedia 24 (2021), 904–916.Google Scholar
- [6] . 2020. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9962–9971.Google Scholar
Cross Ref
- [7] . 2019. Show, control and tell: A framework for generating controllable and grounded captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8307–8316.Google Scholar
Cross Ref
- [8] . 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 2 (2018), 1–21.Google Scholar
Digital Library
- [9] . 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10578–10587.Google Scholar
Cross Ref
- [10] . 2015. Language models for image captioning: The quirks and what works. In 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP’15). Association for Computational Linguistics (ACL), 100–105.Google Scholar
Cross Ref
- [11] . 2022. MuKEA: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5089–5098.Google Scholar
Cross Ref
- [12] . 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473–1482.Google Scholar
Cross Ref
- [13] . 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision. Springer, 15–29.Google Scholar
Digital Library
- [14] . 2019. Image captioning with visual-semantic double attention. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 1 (2019), 1–16.Google Scholar
Digital Library
- [15] . 2019. Image captioning: Transforming objects into words. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 11137–11147.Google Scholar
- [16] . 2020. Joint commonsense and relation reasoning for image and video captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10973–10980.Google Scholar
Cross Ref
- [17] . 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4634–4643.Google Scholar
Cross Ref
- [18] . 2018. Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 499–515.Google Scholar
Digital Library
- [19] . 2021. Bi-directional co-attention network for image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 4 (2021), 1–20.Google Scholar
Digital Library
- [20] . 2020. Kbgn: Knowledge-bridge graph network for adaptive vision-text reasoning in visual dialogue. In Proceedings of the 28th ACM International Conference on Multimedia. 1265–1273.Google Scholar
Digital Library
- [21] . 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.Google Scholar
Cross Ref
- [22] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT. 4171–4186.Google Scholar
- [23] . 2015. Adam: A method for stochastic optimization. In ICLR.Google Scholar
- [24] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.Google Scholar
Digital Library
- [25] . 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891–2903.Google Scholar
Digital Library
- [26] . 2019. Know more say less: Image captioning based on scene graphs. IEEE Transactions on Multimedia 21, 8 (2019), 2117–2130.Google Scholar
Cross Ref
- [27] . 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision. Springer, 121–137.Google Scholar
Digital Library
- [28] . 2021. X-modaler: A versatile and high-performance codebase for cross-modal analytics. In Proceedings of the 29th ACM International Conference on Multimedia. 3799–3802.Google Scholar
Digital Library
- [29] . 2021. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 8518–8526.Google Scholar
Cross Ref
- [30] . 2022. Comprehending and ordering semantics for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17990–17999.Google Scholar
Cross Ref
- [31] . 2022. Contextual transformer networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (2022), 1489–1500.Google Scholar
- [32] . 2004. Rouge: A package for automatic evaluation of summaries. In ACL Workshop.Google Scholar
- [33] . 2014. Microsoft COCO: Common objects in context. In European Conference on Computer Vision. Springer, 740–755.Google Scholar
Cross Ref
- [34] . 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 375–383.Google Scholar
Cross Ref
- [35] . 2018. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7219–7228.Google Scholar
Cross Ref
- [36] . 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In Proceedings of the 29th ACM International Conference on Multimedia. 5600–5608.Google Scholar
Digital Library
- [37] . 2022. Semantic-conditional diffusion networks for image captioning. arXiv preprint arXiv:2212.03099 (2022).Google Scholar
- [38] . 2014. Explain images with multimodal recurrent neural networks. In NIPS Workshop on Deep Learning.Google Scholar
- [39] . 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).Google Scholar
- [40] . 2022. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. In Proceedings of the 30th ACM International Conference on Multimedia. 7070–7074.Google Scholar
Digital Library
- [41] . 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.Google Scholar
Cross Ref
- [42] . 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.Google Scholar
Digital Library
- [43] . 2019. Look back and predict forward in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8367–8375.Google Scholar
Cross Ref
- [44] . 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008–7024.Google Scholar
Cross Ref
- [45] . 2018. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4035–4045.Google Scholar
Cross Ref
- [46] . 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556–2565.Google Scholar
Cross Ref
- [47] . 2021. Enhancing descriptive image captioning with natural language inference. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 269–277.Google Scholar
Cross Ref
- [48] . 2020. Improving image captioning with better use of caption. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7454–7464.Google Scholar
Cross Ref
- [49] . 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems. 3104–3112.Google Scholar
Digital Library
- [50] . 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. 252–259.Google Scholar
Digital Library
- [51] . 2017. Attention is all you need. In NeurIPS.Google Scholar
- [52] . 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.Google Scholar
Cross Ref
- [53] . 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google Scholar
Cross Ref
- [54] . 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 2s (2018), 1–20.Google Scholar
Digital Library
- [55] . 2019. Convolutional auto-encoding of sentence topics for image paragraph generation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 940–946.Google Scholar
Digital Library
- [56] . 2021. Integrating scene semantic knowledge into image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 2 (2021), 1–22.Google Scholar
Digital Library
- [57] . 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).Google Scholar
- [58] . 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. PMLR, 2048–2057.Google Scholar
Digital Library
- [59] . 2019. Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multimedia 22, 5 (2019), 1372–1383.Google Scholar
Cross Ref
- [60] . 2021. Auto-parsing network for image captioning and visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2197–2207.Google Scholar
Cross Ref
- [61] . 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10685–10694.Google Scholar
Cross Ref
- [62] . 2022. Dual vision transformer. arXiv preprint arXiv:2207.04976 (2022).Google Scholar
- [63] . 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684–699.Google Scholar
Digital Library
- [64] . 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2621–2629.Google Scholar
Cross Ref
- [65] . 2022. Wave-vit: Unifying wavelet and transformers for visual representation learning. In European Conference on Computer Vision. Springer, 328–345.Google Scholar
Digital Library
- [66] . 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894–4902.Google Scholar
Cross Ref
- [67] . 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.Google Scholar
Cross Ref
- [68] . 2020. Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recognition 108 (2020), 107563.Google Scholar
Cross Ref
- [69] . 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13041–13049.Google Scholar
Cross Ref
- [70] . 2019. Improving image captioning by leveraging knowledge graphs. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV’19). IEEE, 283–293.Google Scholar
Cross Ref
Index Terms
Bottom-up and Top-down Object Inference Networks for Image Captioning
Recommendations
Bi-Directional Co-Attention Network for Image Captioning
Image Captioning, which automatically describes an image with natural language, is regarded as a fundamental challenge in computer vision. In recent years, significant advance has been made in image captioning through improving attention mechanism. ...
Object-aware semantics of attention for image captioning
AbstractIn image captioning, exploring the advanced semantic concepts is very important for boosting captioning performance. Although much progress has been made in this regard, most existing image captioning models usually neglect the interrelationships ...
Hybrid top-down and bottom-up interprocedural analysis
PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and ImplementationInterprocedural static analyses are broadly classified into top-down and bottom-up, depending upon how they compute, instantiate, and reuse procedure summaries. Both kinds of analyses are challenging to scale: top-down analyses are hindered by ...






Comments