Presentation is loading. Please wait.

Presentation is loading. Please wait.

Visual Navigation Yukun Cui.

Similar presentations


Presentation on theme: "Visual Navigation Yukun Cui."— Presentation transcript:

1 Visual Navigation Yukun Cui

2 Outline Introduction Technical Approaches
Target-driven Visual Navigation in Indoor Scenes Vision-and-Language Navigation Summary and discussion

3 Introduction Navigation How can I get to HUB
Heritage Hall to check in?

4 Introduction Outdoor Navigation Based on Map and GPS

5 Introduction How can I find Heritage Hall inside HUB?
Indoor Navigation 1.No detailed information inside the building on Google Map 2.GPS is not accurate inside the building

6 Visual Navigation Visual Navigation can be roughly described as the process of determining a suitable and safe path between a starting and a goal point for a robot/user travelling between them with visual observation.

7 Visual Navigation Map-Based Navigation
Use a global map of the environment to make decisions for navigation Main problem: Self-localization Acquire image information Detect landmarks in current views (edges, corners, objects) Match observed landmarks with those contained in the stored map according to certain criteria Update the robot position, as a function of the matched landmarks location in the map

8 Visual Navigation Map-Building-Based Navigation
Construct a map for navigation Build the map through a training phase guided by humans SLAM for the robot Navigation and Position by Inmotion

9 Visual Navigation Map-less Navigation Visual-Language Navigation
The movements depend on the elements observed in the environment Visual-Language Navigation Link natural language to vision and action in unstructured, previously unseen environments for navigation

10 Target-driven Visual Navigation
Input: Current observation, Image of the target Output: Action in the 3D environment Advantage: Generalized to different scenes and targets Trained in virtual environment and can be used in real scenes.

11 Target-driven Visual Navigation
AI2-THOR Framework Integrating the model with different types of environments plugin-play architecture such that different types of scenes can be easily incorporated a detailed model of the physics of the scene Highly scalable Cheaper and safer training

12 Target-driven Visual Navigation
AI2-THOR framework is designed by integrating a physics engine Unity 3D with a deep learning framework Tensorflow. The feedback from the environment can be immediately used for online decision making. To generalize to real-world images: Mimic the appearance of the real-world as closely as possible

13 Target-driven Visual Navigation
32 scenes in 4 common scene types: kitchen, living room, bedroom, and bathroom. 68 object each scene. The framework can be used for more fine-grained physical interactions

14 Target-driven Visual Navigation
AI2-THOR framework VS others. AI2-THOR is closer to the real world.

15 Target-driven Visual Navigation
Target-driven Navigation Model Deep reinforcement learning (DRL) models provide an end-to-end learning framework for transforming pixel information into actions. Standard DRL models aim at finding a direct mapping from state representations s to policy π(s). The goal is hardcoded in neural network parameters. Thus, changes in goals would require to update the network parameters in accordance.

16 Target-driven Visual Navigation
Problem: Lack of generalization When incorporating new targets, a new model have to re-trained. The task objective is specified as inputs to the model instead of implanting the target in the model parameters. Action a at time t can be drawn by a ∼ π ( st, g | u ) Actions are conditioned on both states and targets. No re-training for new targets is required.

17 Target-driven Visual Navigation
Target-driven Navigation Model Action space four actions: moving forward, moving backward, turning left, and turning right step length (0.5 meters) and turning angle (90 degree) Observations and Goals Both observations and goals are images taken by the agent’s RGB camera in its first- person view. The task objective is to navigate to the location and viewpoint where the target image is taken Reward design: minimizing the trajectory length to the navigation targets goal-reaching reward (10.0) a small time penalty (-0.01) as immediate reward

18 Target-driven Visual Navigation
Targets across all scenes share the same generic siamese layers Targets within a scene share the same scene-specific layer

19 Target-driven Visual Navigation
Target-driven Navigation Model A3C: Reinforcement learning model that learns by running multiple copies of training threads in parallel and updates a shared set of model parameters in an asynchronous manner. Each thread runs with a different navigation target Scene-specific layers: updated by gradients from the navigation tasks within the scene Generic siamese layers: updated by all targets

20 Target-driven Visual Navigation
Target-driven Navigation Model Implemented in Tensorflow Trained on Nvidia GeForce GTX Titan X GPU Target-driven model learns better navigation policies compared to the state- of-the-art A3C methods after 100M training frames.

21 Target-driven Visual Navigation
Generalizing to new targets within one scene All models are trained with 20M frames. Success rate: percentage of trajectories shorter than 500 steps Consistent trend of increasing success rate, as we increase the number of trained targets

22 Target-driven Visual Navigation
Generalizing to new scenes Faster convergence as the number of trained scenes grows. Transferring generic data efficiency for learning in new environments, compared to training from scratch.

23 Target-driven Visual Navigation

24 Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Existing vision and language methods (e.g. VQA) can be successfully applied to VLN

25 Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Matterport3D Simulator Large-scale visual reinforcement learning simulation environment Matterport3D dataset

26 3 associated navigation instructions
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Room to Room Dataset 7,189 shortest paths from different (s, v) pairs. (Start and Goal) 400 AMT workers and 1,600 hours of Annotation time 21,567 navigation instructions, Average length 29 words 3 associated navigation instructions

27 Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Room to Room Dataset Distribution based on first 4 words. Read from the center outwards. Arc length: word’s contribution

28 Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Action Space: left, right, up, down, forward and stop, 30 degrees of turning angle State and Reward: similar to previous method Teacher-force LSTM-based sequence-to-sequence architecture with an attention mechanism Natural language instruction 𝑥 = 𝑥 1 , 𝑥 2 ,…, 𝑥 𝐿 Initial image observation 𝑜 0

29 Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Language instruction encoding ℎ 𝑖 = LSTM 𝑒𝑛𝑐 ( 𝑥 𝑖 , ℎ 𝑖−1 ) ℎ = ℎ 1 , ℎ 2 ,…, ℎ 𝐿 -- encoder context used in the attention mechanism Image and action embedding Pretrained ResNet-152 extract feature of observation 𝑜 𝑡 Embed image and previous action to 𝑞 𝑡 ℎ′ 𝑡 = LSTM 𝑑𝑒𝑐 ( 𝑞 𝑡 , ℎ′ 𝑡−1 ) Action prediction with attention mechanism 𝑐 𝑡 =𝑓( ℎ′ 𝑡 , ℎ ) ℎ 𝑡 =tanh( 𝑊 𝐶 [ 𝑐 𝑡 ; ℎ′ 𝑡 ]) 𝑎 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥( ℎ 𝑡 )

30 Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Student-force Online version of DAGGER A reduction of imitation learning and structured prediction to no-regret online learning

31 Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments A sample output Blue discs indicate nearby (discretized) navigation options Instruction: Head upstairs and walk past the piano through an archway directly in front. Turn right when the hallway ends at pictures and table. Wait by the moose antlers hanging on the wall.

32 Summary and Discussion
Recent researches mostly focus on Map-less approaches and VLN Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

33 Summary and Discussion
Future work More actions, such as interact with items in the environment

34 Summary and Discussion
Future work More actions, such as interact with items in the enviroment Augmented Reality

35 Thank you!


Download ppt "Visual Navigation Yukun Cui."

Similar presentations


Ads by Google