Abstract
Vision-and-Language Navigation (VLN) has been an emerging and fast-developing research topic, where an embodied agent is required to navigate in a real-world environment based on natural language instructions. In this article, we present a Direction-guided Navigator Agent (DNA) that novelly integrates direction clues derived from instructions into the essential encoder-decoder navigation framework. Particularly, DNA couples the standard instruction encoder with an additional direction branch which sequentially encodes the direction clues in the instructions to boost navigation. Furthermore, an Instruction Flipping mechanism is uniquely devised to enable fast data augmentation as well as a follow-up backtracing for navigating the agent in a backward direction. Such a way naturally amplifies the grounding of instruction in the local visual scenes along both forward and backward directions, and thus strengthens the alignment between instruction and action sequence. Extensive experiments conducted on Room to Room (R2R) dataset validate our proposal and demonstrate quantitatively compelling results.
- [1] . 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 3674–3683.Google Scholar
Cross Ref
- [2] . 2015. VQA: Visual question answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision. 2425–2433.Google Scholar
Digital Library
- [3] . 2013. Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association for Computational Linguistics 1 (2013), 49–62. https://dblp.org/rec/journals/tacl/ArtziZ13.html?view=bibtex.Google Scholar
Cross Ref
- [4] . 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations.Google Scholar
- [5] . 2018. Gated-attention architectures for task-oriented language grounding. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence, and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence.2819–2826.Google Scholar
Cross Ref
- [6] . 2011. Learning to interpret natural language navigation instructions from observations. In Proceedings of the 25th AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [7] . 2020. UNITER: UNiversal image-TExt representation learning. In Proceedings of the European Conference on Computer Vision.Springer, 104–120.Google Scholar
Digital Library
- [8] . 2019. Visual dialog. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 5 (2019), 1242–1256.Google Scholar
Cross Ref
- [9] . 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.1473–1482.Google Scholar
Cross Ref
- [10] . 2018. Speaker-follower models for vision-and-language navigation. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems.3318–3329.Google Scholar
- [11] . 2011. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics.315–323.Google Scholar
- [12] . 2018. IQA: Visual question answering in interactive environments. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition.4089–4098.Google Scholar
Cross Ref
- [13] . 2020. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE, 13134–13143.Google Scholar
Cross Ref
- [14] . 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition.770–778.Google Scholar
Cross Ref
- [15] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- [16] . 2016. Natural language object retrieval. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition.4555–4564.Google Scholar
Cross Ref
- [17] . 2019. Transferable representation learning in vision-and-language navigation. In Proceedings of the 2019 IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- [18] . 2019. Stay on the path: Instruction fidelity in vision-and-language navigation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Google Scholar
Cross Ref
- [19] . 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 664–676.Google Scholar
Digital Library
- [20] . 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations.Google Scholar
- [21] . 2022. Uni-EDEN: Universal encoder-decoder network by multi-granular vision-language pre-training. arXiv:2201.04026. Retrieved from https://arxiv.org/abs/2201.04026.Google Scholar
- [22] . 2021. X-modaler: A versatile and high-performance codebase for cross-modal analytics. In Proceedings of the 29th ACM International Conference on Multimedia. 3799–3802.Google Scholar
Digital Library
- [23] . 2021. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. In Proceedings of the AAAI Conference on Artificial Intelligence. 8518–8526.Google Scholar
Cross Ref
- [24] . 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In Proceedings of the 29th ACM International Conference on Multimedia. 5600–5608.Google Scholar
Digital Library
- [25] . 2019. Self-monitoring navigation agent via auxiliary progress estimation. In Proceedings of the 7th International Conference on Learning Representations.Google Scholar
- [26] . 2019. The regretful agent: Heuristic-aided navigation through progress estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.6732–6740.Google Scholar
Cross Ref
- [27] . 2006. Walk the talk: Connecting language, knowledge, and action in route instructions. In Proceedings of the 21st National Conference on Artificial Intelligence and the 18th Innovative Applications of Artificial Intelligence Conference.1475–1482.Google Scholar
- [28] . 2020. Improving vision-and-language navigation with image-text pairs from the web. In Proceedings of the European Conference on Computer Vision., , , and (Eds.), Lecture Notes in Computer Science, Vol. 12351. Springer, 259–274.Google Scholar
Digital Library
- [29] . 2016. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In Proceedings of the 13th AAAI Conference on Artificial Intelligence.2772–2778.Google Scholar
Cross Ref
- [30] . 2017. Learning to navigate in complex environments. In Proceedings of the 5th International Conference on Learning Representations.Google Scholar
- [31] . 2016. Modeling context between objects for referring expression understanding. In Proceedings of the 14th European Conference on Computer Vision.792–807.Google Scholar
Cross Ref
- [32] . 2020. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. ACM Trans. Multimedia Comput. Commun. Appl. 18, 2, Article 48 (2022), 16 pages.Google Scholar
- [33] . 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4594–4602.Google Scholar
Cross Ref
- [34] . 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6504–6512.Google Scholar
Cross Ref
- [35] . 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.Google Scholar
Cross Ref
- [36] . 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.Google Scholar
Cross Ref
- [37] . 2017. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision 123, 1 (2017), 74–93.Google Scholar
Digital Library
- [38] . 2016. Grounding of textual phrases in images by reconstruction. In Proceedings of the 14th European Conference on Computer Vision.817–834.Google Scholar
Cross Ref
- [39] . 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.Google Scholar
Digital Library
- [40] . 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, , , , and (Eds.), Association for Computational Linguistics, 5099–5110.Google Scholar
Cross Ref
- [41] . 2011. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the 25th AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [42] . 2015. Sequence to sequence-video to text. In Proceedings of the 2015 IEEE International Conference on Computer Vision.4534–4542.Google Scholar
Digital Library
- [43] . 2010. Learning to follow navigational directions. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.Google Scholar
Digital Library
- [44] . 2019. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.6629–6638.Google Scholar
Cross Ref
- [45] . 2018. Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In Proceedings of the 15th European Conference on Computer Vision.38–55.Google Scholar
Digital Library
- [46] . 2019. Embodied question answering in photorealistic environments with point cloud perception. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.6659–6668.Google Scholar
Cross Ref
- [47] . 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning.2048–2057.Google Scholar
- [48] . 2015. Describing videos by exploiting temporal structure. In Proceedings of the 2015 IEEE International Conference on Computer Vision.4507–4515.Google Scholar
Digital Library
- [49] . 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision.684–699.Google Scholar
Digital Library
- [50] . 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2621–2629.Google Scholar
Cross Ref
- [51] . 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894–4902.Google Scholar
Cross Ref
- [52] . 2018. MAttNet: Modular attention network for referring expression comprehension. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 1307–1315.Google Scholar
Cross Ref
- [53] . 2020. Deep multimodal neural architecture search. In Proceedings of the 28th ACM International Conference on Multimedia., , , , , , and (Eds.), ACM, 3743–3752.Google Scholar
Digital Library
- [54] . 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 6281–6290.Google Scholar
Cross Ref
- [55] . 2020. Vision-language navigation with self-supervised auxiliary reasoning tasks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE, 10009–10019.Google Scholar
Cross Ref
Index Terms
Boosting Vision-and-Language Navigation with Direction Guiding and Backtracing
Recommendations
Learning Disentanglement with Decoupled Labels for Vision-Language Navigation
Computer Vision – ECCV 2022AbstractVision-and-Language Navigation (VLN) requires an agent to follow complex natural language instructions and perceive the visual environment for real-world navigation. Intuitively, we find that instruction disentanglement for each viewpoint along ...
Vision-Based Humanoid Robot Navigation in a Featureless Environment
MCPR 2015: Proceedings of the 7th Mexican Conference on Pattern Recognition - Volume 9116One of the most basic tasks for any autonomous mobile robot is that of safely navigating from one point to another e.g. service robots should be able to find their way in different kinds of environments. Typically, vision is used to find landmarks in ...
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory
AbstractThe role of robots in society keeps expanding, bringing with it the necessity of interacting and communicating with humans. In order to keep such interaction intuitive, we provide automatic wayfinding based on verbal navigational instructions. Our ...






Comments