skip to main content
research-article

Boosting Vision-and-Language Navigation with Direction Guiding and Backtracing

Published:05 January 2023Publication History
Skip Abstract Section

Abstract

Vision-and-Language Navigation (VLN) has been an emerging and fast-developing research topic, where an embodied agent is required to navigate in a real-world environment based on natural language instructions. In this article, we present a Direction-guided Navigator Agent (DNA) that novelly integrates direction clues derived from instructions into the essential encoder-decoder navigation framework. Particularly, DNA couples the standard instruction encoder with an additional direction branch which sequentially encodes the direction clues in the instructions to boost navigation. Furthermore, an Instruction Flipping mechanism is uniquely devised to enable fast data augmentation as well as a follow-up backtracing for navigating the agent in a backward direction. Such a way naturally amplifies the grounding of instruction in the local visual scenes along both forward and backward directions, and thus strengthens the alignment between instruction and action sequence. Extensive experiments conducted on Room to Room (R2R) dataset validate our proposal and demonstrate quantitatively compelling results.

REFERENCES

  1. [1] Anderson Peter, Wu Qi, Teney Damien, Bruce Jake, Johnson Mark, Sünderhauf Niko, Reid Ian D., Gould Stephen, and Hengel Anton van den. 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 36743683.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C. Lawrence, and Parikh Devi. 2015. VQA: Visual question answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision. 24252433.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Artzi Yoav and Zettlemoyer Luke. 2013. Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association for Computational Linguistics 1 (2013), 49–62. https://dblp.org/rec/journals/tacl/ArtziZ13.html?view=bibtex.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations.Google ScholarGoogle Scholar
  5. [5] Chaplot Devendra Singh, Sathyendra Kanthashree Mysore, Pasumarthi Rama Kumar, Rajagopal Dheeraj, and Salakhutdinov Ruslan. 2018. Gated-attention architectures for task-oriented language grounding. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence, and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence.28192826.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Chen David L. and Mooney Raymond J.. 2011. Learning to interpret natural language navigation instructions from observations. In Proceedings of the 25th AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Chen Yen-Chun, Li Linjie, Yu Licheng, Kholy Ahmed El, Ahmed Faisal, Gan Zhe, Cheng Yu, and Liu Jingjing. 2020. UNITER: UNiversal image-TExt representation learning. In Proceedings of the European Conference on Computer Vision.Springer, 104120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Das Abhishek, Kottur Satwik, Gupta Khushi, Singh Avi, Yadav Deshraj, Lee Stefan, Moura José M. F., Parikh Devi, and Batra Dhruv. 2019. Visual dialog. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 5 (2019), 12421256.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Fang Hao, Gupta Saurabh, Iandola Forrest N., Srivastava Rupesh Kumar, Deng Li, Dollár Piotr, Gao Jianfeng, He Xiaodong, Mitchell Margaret, Platt John C., Zitnick C. Lawrence, and Zweig Geoffrey. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.14731482.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Fried Daniel, Hu Ronghang, Cirik Volkan, Rohrbach Anna, Andreas Jacob, Morency Louis-Philippe, Berg-Kirkpatrick Taylor, Saenko Kate, Klein Dan, and Darrell Trevor. 2018. Speaker-follower models for vision-and-language navigation. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems.33183329.Google ScholarGoogle Scholar
  11. [11] Glorot Xavier, Bordes Antoine, and Bengio Yoshua. 2011. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics.315323.Google ScholarGoogle Scholar
  12. [12] Gordon Daniel, Kembhavi Aniruddha, Rastegari Mohammad, Redmon Joseph, Fox Dieter, and Farhadi Ali. 2018. IQA: Visual question answering in interactive environments. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition.40894098.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Hao Weituo, Li Chunyuan, Li Xiujun, Carin Lawrence, and Gao Jianfeng. 2020. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE, 1313413143.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition.770778.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Hu Ronghang, Xu Huazhe, Rohrbach Marcus, Feng Jiashi, Saenko Kate, and Darrell Trevor. 2016. Natural language object retrieval. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition.45554564.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Huang Haoshuo, Jain Vihan, Mehta Harsh, Ku Alexander, Magalhães Gabriel, Baldridge Jason, and Ie Eugene. 2019. Transferable representation learning in vision-and-language navigation. In Proceedings of the 2019 IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Jain Vihan, Magalhaes Gabriel, Ku Alexander, Vaswani Ashish, Ie Eugene, and Baldridge Jason. 2019. Stay on the path: Instruction fidelity in vision-and-language navigation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Karpathy Andrej and Li Fei-Fei. 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 664676.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Kingma Diederik P. and Ba Jimmy. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations.Google ScholarGoogle Scholar
  21. [21] Li Yehao, Fan Jiahao, Pan Yingwei, Yao Ting, Lin Weiyao, and Mei Tao. 2022. Uni-EDEN: Universal encoder-decoder network by multi-granular vision-language pre-training. arXiv:2201.04026. Retrieved from https://arxiv.org/abs/2201.04026.Google ScholarGoogle Scholar
  22. [22] Li Yehao, Pan Yingwei, Chen Jingwen, Yao Ting, and Mei Tao. 2021. X-modaler: A versatile and high-performance codebase for cross-modal analytics. In Proceedings of the 29th ACM International Conference on Multimedia. 37993802.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Li Yehao, Pan Yingwei, Yao Ting, Chen Jingwen, and Mei Tao. 2021. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. In Proceedings of the AAAI Conference on Artificial Intelligence. 85188526.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Luo Jianjie, Li Yehao, Pan Yingwei, Yao Ting, Chao Hongyang, and Mei Tao. 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In Proceedings of the 29th ACM International Conference on Multimedia. 56005608.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Ma Chih-Yao, Lu Jiasen, Wu Zuxuan, AlRegib Ghassan, Kira Zsolt, Socher Richard, and Xiong Caiming. 2019. Self-monitoring navigation agent via auxiliary progress estimation. In Proceedings of the 7th International Conference on Learning Representations.Google ScholarGoogle Scholar
  26. [26] Ma Chih-Yao, Wu Zuxuan, AlRegib Ghassan, Xiong Caiming, and Kira Zsolt. 2019. The regretful agent: Heuristic-aided navigation through progress estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.67326740.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] MacMahon Matt, Stankiewicz Brian, and Kuipers Benjamin. 2006. Walk the talk: Connecting language, knowledge, and action in route instructions. In Proceedings of the 21st National Conference on Artificial Intelligence and the 18th Innovative Applications of Artificial Intelligence Conference.14751482.Google ScholarGoogle Scholar
  28. [28] Majumdar Arjun, Shrivastava Ayush, Lee Stefan, Anderson Peter, Parikh Devi, and Batra Dhruv. 2020. Improving vision-and-language navigation with image-text pairs from the web. In Proceedings of the European Conference on Computer Vision.Vedaldi Andrea, Bischof Horst, Brox Thomas, and Frahm Jan-Michael (Eds.), Lecture Notes in Computer Science, Vol. 12351. Springer, 259274.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Mei Hongyuan, Bansal Mohit, and Walter Matthew R.. 2016. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In Proceedings of the 13th AAAI Conference on Artificial Intelligence.27722778.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Mirowski Piotr, Pascanu Razvan, Viola Fabio, Soyer Hubert, Ballard Andy, Banino Andrea, Denil Misha, Goroshin Ross, Sifre Laurent, Kavukcuoglu Koray, Kumaran Dharshan, and Hadsell Raia. 2017. Learning to navigate in complex environments. In Proceedings of the 5th International Conference on Learning Representations.Google ScholarGoogle Scholar
  31. [31] Nagaraja Varun K., Morariu Vlad I., and Davis Larry S.. 2016. Modeling context between objects for referring expression understanding. In Proceedings of the 14th European Conference on Computer Vision.792807.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Pan Yingwei, Li Yehao, Luo Jianjie, Xu Jun, Yao Ting, and Mei Tao. 2020. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. ACM Trans. Multimedia Comput. Commun. Appl. 18, 2, Article 48 (2022), 16 pages.Google ScholarGoogle Scholar
  33. [33] Pan Yingwei, Mei Tao, Yao Ting, Li Houqiang, and Rui Yong. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 45944602.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Pan Yingwei, Yao Ting, Li Houqiang, and Mei Tao. 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 65046512.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Pan Yingwei, Yao Ting, Li Yehao, and Mei Tao. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1097110980.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Pennington Jeffrey, Socher Richard, and Manning Christopher. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Plummer Bryan A., Wang Liwei, Cervantes Chris M., Caicedo Juan C., Hockenmaier Julia, and Lazebnik Svetlana. 2017. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision 123, 1 (2017), 7493.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Rohrbach Anna, Rohrbach Marcus, Hu Ronghang, Darrell Trevor, and Schiele Bernt. 2016. Grounding of textual phrases in images by reconstruction. In Proceedings of the 14th European Conference on Computer Vision.817834.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Russakovsky Olga, Deng Jia, Su Hao, Krause Jonathan, Satheesh Sanjeev, Ma Sean, Huang Zhiheng, Karpathy Andrej, Khosla Aditya, Bernstein Michael S., Berg Alexander C., and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Tan Hao and Bansal Mohit. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Inui Kentaro, Jiang Jing, Ng Vincent, and Wan Xiaojun (Eds.), Association for Computational Linguistics, 50995110.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Tellex Stefanie, Kollar Thomas, Dickerson Steven, Walter Matthew R., Banerjee Ashis Gopal, Teller Seth J., and Roy Nicholas. 2011. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the 25th AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Venugopalan Subhashini, Rohrbach Marcus, Donahue Jeffrey, Mooney Raymond J., Darrell Trevor, and Saenko Kate. 2015. Sequence to sequence-video to text. In Proceedings of the 2015 IEEE International Conference on Computer Vision.45344542.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Vogel Adam and Jurafsky Daniel. 2010. Learning to follow navigational directions. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Wang Xin, Huang Qiuyuan, Çelikyilmaz Asli, Gao Jianfeng, Shen Dinghan, Wang Yuan-Fang, Wang William Yang, and Zhang Lei. 2019. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.66296638.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Wang Xin, Xiong Wenhan, Wang Hongmin, and Wang William Yang. 2018. Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In Proceedings of the 15th European Conference on Computer Vision.3855.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Wijmans Erik, Datta Samyak, Maksymets Oleksandr, Das Abhishek, Gkioxari Georgia, Lee Stefan, Essa Irfan, Parikh Devi, and Batra Dhruv. 2019. Embodied question answering in photorealistic environments with point cloud perception. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.66596668.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Xu Kelvin, Ba Jimmy, Kiros Ryan, Cho Kyunghyun, Courville Aaron C., Salakhutdinov Ruslan, Zemel Richard S., and Bengio Yoshua. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning.20482057.Google ScholarGoogle Scholar
  48. [48] Yao Li, Torabi Atousa, Cho Kyunghyun, Ballas Nicolas, Pal Christopher J., Larochelle Hugo, and Courville Aaron C.. 2015. Describing videos by exploiting temporal structure. In Proceedings of the 2015 IEEE International Conference on Computer Vision.45074515.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision.684699.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 26212629.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Yao Ting, Pan Yingwei, Li Yehao, Qiu Zhaofan, and Mei Tao. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 48944902.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Yu Licheng, Lin Zhe, Shen Xiaohui, Yang Jimei, Lu Xin, Bansal Mohit, and Berg Tamara L.. 2018. MAttNet: Modular attention network for referring expression comprehension. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 13071315.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Yu Zhou, Cui Yuhao, Yu Jun, Wang Meng, Tao Dacheng, and Tian Qi. 2020. Deep multimodal neural architecture search. In Proceedings of the 28th ACM International Conference on Multimedia.Chen Chang Wen, Cucchiara Rita, Hua Xian-Sheng, Qi Guo-Jun, Ricci Elisa, Zhang Zhengyou, and Zimmermann Roger (Eds.), ACM, 37433752.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Yu Zhou, Yu Jun, Cui Yuhao, Tao Dacheng, and Tian Qi. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 62816290.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Zhu Fengda, Zhu Yi, Chang Xiaojun, and Liang Xiaodan. 2020. Vision-language navigation with self-supervised auxiliary reasoning tasks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE, 1000910019.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Boosting Vision-and-Language Navigation with Direction Guiding and Backtracing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 1
        January 2023
        505 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3572858
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 January 2023
        • Online AM: 17 March 2022
        • Accepted: 24 January 2022
        • Revised: 6 July 2021
        • Received: 18 October 2020
        Published in tomm Volume 19, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)171
        • Downloads (Last 6 weeks)21

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!