Title: Towards Embodied Navigation through Vision and Language
Abstract: Embodied AI seeks to create autonomous systems that assist humans in completing challenging tasks. Embodiment denotes the agents' ability to understand and interact with the environment based on informed reasoning. Their use in service and industrial robotics makes them relevant and powerful.
These agents must perceive the real world through various sensors, reason about the possible actions, interact and communicate with the humans for clarification about the task, and act based on their understanding While embodied AI is a challenging task, we selected a sub-problem called vision and language navigation (VLN), where the agent follows natural language commands to navigate inside an indoor environment. Here, vision serves as an important sensory input for perception and reasoning about the environment. Visual semantics such as object category, location, object-object relationships, object-place relationships, and environment geometry can be used to ground or connect to natural language instructions. The objective of this project is to investigate methods to improve vision and language navigation by leveraging the various semantics from visual models and efficiently ground them using natural language instructions. This work aims to develops methods that extract meaningful information from language and vision to enable navigation in an unseen environment.