Visual Navigation Yukun Cui.

Visual Navigation Yukun Cui

Outline Introduction Technical Approaches
Target-driven Visual Navigation in Indoor Scenes Vision-and-Language Navigation Summary and discussion

Introduction Navigation How can I get to HUB
Heritage Hall to check in?

Introduction Outdoor Navigation Based on Map and GPS

Introduction How can I find Heritage Hall inside HUB?
Indoor Navigation 1.No detailed information inside the building on Google Map 2.GPS is not accurate inside the building

Visual Navigation Visual Navigation can be roughly described as the process of determining a suitable and safe path between a starting and a goal point for a robot/user travelling between them with visual observation.

Visual Navigation Map-Based Navigation
Use a global map of the environment to make decisions for navigation Main problem: Self-localization Acquire image information Detect landmarks in current views (edges, corners, objects) Match observed landmarks with those contained in the stored map according to certain criteria Update the robot position, as a function of the matched landmarks location in the map

Visual Navigation Map-Building-Based Navigation
Construct a map for navigation Build the map through a training phase guided by humans SLAM for the robot Navigation and Position by Inmotion

Visual Navigation Map-less Navigation Visual-Language Navigation
The movements depend on the elements observed in the environment Visual-Language Navigation Link natural language to vision and action in unstructured, previously unseen environments for navigation

Target-driven Visual Navigation
Input: Current observation, Image of the target Output: Action in the 3D environment Advantage: Generalized to different scenes and targets Trained in virtual environment and can be used in real scenes.

AI2-THOR Framework Integrating the model with different types of environments plugin-play architecture such that different types of scenes can be easily incorporated a detailed model of the physics of the scene Highly scalable Cheaper and safer training

AI2-THOR framework is designed by integrating a physics engine Unity 3D with a deep learning framework Tensorﬂow. The feedback from the environment can be immediately used for online decision making. To generalize to real-world images: Mimic the appearance of the real-world as closely as possible

32 scenes in 4 common scene types: kitchen, living room, bedroom, and bathroom. 68 object each scene. The framework can be used for more ﬁne-grained physical interactions

AI2-THOR framework VS others. AI2-THOR is closer to the real world.

Target-driven Navigation Model Deep reinforcement learning (DRL) models provide an end-to-end learning framework for transforming pixel information into actions. Standard DRL models aim at ﬁnding a direct mapping from state representations s to policy π(s). The goal is hardcoded in neural network parameters. Thus, changes in goals would require to update the network parameters in accordance.

Problem: Lack of generalization When incorporating new targets, a new model have to re-trained. The task objective is specified as inputs to the model instead of implanting the target in the model parameters. Action a at time t can be drawn by a ∼ π ( st, g | u ) Actions are conditioned on both states and targets. No re-training for new targets is required.

Target-driven Navigation Model Action space four actions: moving forward, moving backward, turning left, and turning right step length (0.5 meters) and turning angle (90 degree) Observations and Goals Both observations and goals are images taken by the agent’s RGB camera in its ﬁrst- person view. The task objective is to navigate to the location and viewpoint where the target image is taken Reward design: minimizing the trajectory length to the navigation targets goal-reaching reward (10.0) a small time penalty (-0.01) as immediate reward

Targets across all scenes share the same generic siamese layers Targets within a scene share the same scene-speciﬁc layer

Target-driven Navigation Model A3C: Reinforcement learning model that learns by running multiple copies of training threads in parallel and updates a shared set of model parameters in an asynchronous manner. Each thread runs with a different navigation target Scene-speciﬁc layers: updated by gradients from the navigation tasks within the scene Generic siamese layers: updated by all targets

Target-driven Navigation Model Implemented in Tensorﬂow Trained on Nvidia GeForce GTX Titan X GPU Target-driven model learns better navigation policies compared to the state- of-the-art A3C methods after 100M training frames.

Generalizing to new targets within one scene All models are trained with 20M frames. Success rate: percentage of trajectories shorter than 500 steps Consistent trend of increasing success rate, as we increase the number of trained targets

Generalizing to new scenes Faster convergence as the number of trained scenes grows. Transferring generic data efﬁciency for learning in new environments, compared to training from scratch.

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Existing vision and language methods (e.g. VQA) can be successfully applied to VLN

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Matterport3D Simulator Large-scale visual reinforcement learning simulation environment Matterport3D dataset

3 associated navigation instructions
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Room to Room Dataset 7,189 shortest paths from different (s, v) pairs. (Start and Goal) 400 AMT workers and 1,600 hours of Annotation time 21,567 navigation instructions, Average length 29 words 3 associated navigation instructions

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Room to Room Dataset Distribution based on first 4 words. Read from the center outwards. Arc length: word’s contribution

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Action Space: left, right, up, down, forward and stop, 30 degrees of turning angle State and Reward: similar to previous method Teacher-force LSTM-based sequence-to-sequence architecture with an attention mechanism Natural language instruction 𝑥 = 𝑥 1 , 𝑥 2 ,…, 𝑥 𝐿 Initial image observation 𝑜 0

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Language instruction encoding ℎ 𝑖 = LSTM 𝑒𝑛𝑐 ( 𝑥 𝑖 , ℎ 𝑖−1 ) ℎ = ℎ 1 , ℎ 2 ,…, ℎ 𝐿 -- encoder context used in the attention mechanism Image and action embedding Pretrained ResNet-152 extract feature of observation 𝑜 𝑡 Embed image and previous action to 𝑞 𝑡 ℎ′ 𝑡 = LSTM 𝑑𝑒𝑐 ( 𝑞 𝑡 , ℎ′ 𝑡−1 ) Action prediction with attention mechanism 𝑐 𝑡 =𝑓( ℎ′ 𝑡 , ℎ ) ℎ 𝑡 =tanh( 𝑊 𝐶 [ 𝑐 𝑡 ; ℎ′ 𝑡 ]) 𝑎 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥( ℎ 𝑡 )

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Student-force Online version of DAGGER A reduction of imitation learning and structured prediction to no-regret online learning

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments A sample output Blue discs indicate nearby (discretized) navigation options Instruction: Head upstairs and walk past the piano through an archway directly in front. Turn right when the hallway ends at pictures and table. Wait by the moose antlers hanging on the wall.

Summary and Discussion
Recent researches mostly focus on Map-less approaches and VLN Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

Future work More actions, such as interact with items in the environment

Future work More actions, such as interact with items in the enviroment Augmented Reality

Thank you!

Visual Navigation Yukun Cui.

Similar presentations

Presentation on theme: "Visual Navigation Yukun Cui."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Visual Navigation Yukun Cui.

Similar presentations

Presentation on theme: "Visual Navigation Yukun Cui."— Presentation transcript:

Similar presentations

About project

Feedback