OVL-MAP: An Online Visual Language Map Approach for Vision-and-Language Navigation in Continuous Environments

Shuhuan Wen*, Ziyuan Zhang, Yuxiang Sun, Zhiwen Wang

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

Abstract

Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to navigate 3D environments based on visual observations and natural language instructions. Existing approaches, focused on topological and semantic maps, often face limitations in accurately understanding and adapting to complex or previously unseen environments, particularly due to static and offline map constructions. To address these challenges, this paper proposes OVL-MAP, an innovative algorithm comprising three key modules: an online vision-and-language map construction module, a waypoint prediction module, and an action decision module. The online map construction module leverages robust open-vocabulary semantic segmentation to dynamically enhance the agent's scene understanding. The waypoint prediction module processes natural language instructions to identify task-relevant regions, predict sub-goal locations, and guide trajectory planning. The action decision module utilizes the DD-PPO strategy for effective navigation. Evaluations on the Robo-VLN and R2R-CE datasets demonstrate that OVL-MAP significantly improves navigation performance and exhibits stronger generalization in unknown environments. © 2025 IEEE.
Original languageEnglish
Pages (from-to)3294-3301
JournalIEEE Robotics and Automation Letters
Volume10
Issue number4
Online published11 Feb 2025
DOIs
Publication statusPublished - Apr 2025

Research Keywords

  • embodied intelligence
  • multimodal perception
  • Navigation maps
  • vision-based navigation

Fingerprint

Dive into the research topics of 'OVL-MAP: An Online Visual Language Map Approach for Vision-and-Language Navigation in Continuous Environments'. Together they form a unique fingerprint.

Cite this