Visual Navigation Yukun Cui.

Slides:



Advertisements
Similar presentations
Patch to the Future: Unsupervised Visual Prediction
Advertisements

1 Panoramic University of Amsterdam Informatics Institute.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Vision for mobile robot navigation Jannes Eindhoven
Active Vision Key points: Acting to obtain information Eye movements Depth from motion parallax Extracting motion information from a spatio-temporal pattern.
3D SLAM for Omni-directional Camera
International Conference on Computer Vision and Graphics, ICCVG ‘2002 Algorithm for Fusion of 3D Scene by Subgraph Isomorphism with Procrustes Analysis.
Fuzzy Reinforcement Learning Agents By Ritesh Kanetkar Systems and Industrial Engineering Lab Presentation May 23, 2003.
1 Research Question  Can a vision-based mobile robot  with limited computation and memory,  and rapidly varying camera positions,  operate autonomously.
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
Ali Ghadirzadeh, Atsuto Maki, Mårten Björkman Sept 28- Oct Hamburg Germany Presented by Jen-Fang Chang 1.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
A Hierarchical Deep Temporal Model for Group Activity Recognition
Introduction to Machine Learning, its potential usage in network area,
Convolutional Sequence to Sequence Learning
Advanced Computer Systems
Unsupervised Learning of Video Representations using LSTMs
COGNITIVE APPROACH TO ROBOT SPATIAL MAPPING
End-To-End Memory Networks
Summary of “Efficient Deep Learning for Stereo Matching”
Introduction of Reinforcement Learning
Deep Learning Amin Sobhani.
Visual Learning with Navigation as an Example
Deep Reinforcement Learning
Chilimbi, et al. (2014) Microsoft Research
Paper – Stephen Se, David Lowe, Jim Little
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Deep Predictive Model for Autonomous Driving
Jure Zbontar, Yann LeCun
Adversarial Learning for Neural Dialogue Generation
ReinforcementLearning: A package for replicating human behavior in R
Tracking Objects with Dynamics
AlphaGo with Deep RL Alpha GO.
Compositional Human Pose Regression
Artificial Intelligence Lecture No. 5
Deep reinforcement learning
Intelligent Information System Lab
Intro to NLP and Deep Learning
CIS 488/588 Bruce R. Maxim UM-Dearborn
Reinforcement learning with unsupervised auxiliary tasks
Hybrid computing using a neural network with dynamic external memory
Deep Learning and Newtonian Physics
"Playing Atari with deep reinforcement learning."
Unified Pragmatic Models for Generating and Following Instructions
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Human-level control through deep reinforcement learning
Deep Reinforcement Learning in Navigation
Mixed Reality Server under Robot Operating System
Learning a Policy for Opportunistic Active Learning
Neural Networks Geoff Hulten.
Neural Speech Synthesis with Transformer Network
Unsupervised Pretraining for Semantic Parsing
Emir Zeylan Stylianos Filippou
Neural networks (3) Regularization Autoencoder
Word embeddings (continued)
Topological Signatures For Fast Mobility Analysis
Attention for translation
NPS Introduction to GIS: Lecture 1 Based on NIMC and Other Sources.
Unsupervised Perceptual Rewards For Imitation Learning
Tianhe Yu, Pieter Abbeel, Sergey Levine, Chelsea Finn
Human-object interaction
Neural Machine Translation using CNN
Distributed Reinforcement Learning for Multi-Robot Decentralized Collective Construction Gyu-Young Hwang
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Sequence-to-Sequence Models
Week 7 Presentation Ngoc Ta Aidean Sharghi
Learning to Detect Human-Object Interactions with Knowledge
Visual Grounding.
Morteza Kheirkhah University College London
Presentation transcript:

Visual Navigation Yukun Cui

Outline Introduction Technical Approaches Target-driven Visual Navigation in Indoor Scenes Vision-and-Language Navigation Summary and discussion

Introduction Navigation How can I get to HUB Heritage Hall to check in?

Introduction Outdoor Navigation Based on Map and GPS

Introduction How can I find Heritage Hall inside HUB? Indoor Navigation 1.No detailed information inside the building on Google Map 2.GPS is not accurate inside the building

Visual Navigation Visual Navigation can be roughly described as the process of determining a suitable and safe path between a starting and a goal point for a robot/user travelling between them with visual observation.

Visual Navigation Map-Based Navigation Use a global map of the environment to make decisions for navigation Main problem: Self-localization Acquire image information Detect landmarks in current views (edges, corners, objects) Match observed landmarks with those contained in the stored map according to certain criteria Update the robot position, as a function of the matched landmarks location in the map

Visual Navigation Map-Building-Based Navigation Construct a map for navigation Build the map through a training phase guided by humans SLAM for the robot Navigation and Position by Inmotion

Visual Navigation Map-less Navigation Visual-Language Navigation The movements depend on the elements observed in the environment Visual-Language Navigation Link natural language to vision and action in unstructured, previously unseen environments for navigation

Target-driven Visual Navigation Input: Current observation, Image of the target Output: Action in the 3D environment Advantage: Generalized to different scenes and targets Trained in virtual environment and can be used in real scenes.

Target-driven Visual Navigation AI2-THOR Framework Integrating the model with different types of environments plugin-play architecture such that different types of scenes can be easily incorporated a detailed model of the physics of the scene Highly scalable Cheaper and safer training

Target-driven Visual Navigation AI2-THOR framework is designed by integrating a physics engine Unity 3D with a deep learning framework Tensorflow. The feedback from the environment can be immediately used for online decision making. To generalize to real-world images: Mimic the appearance of the real-world as closely as possible

Target-driven Visual Navigation 32 scenes in 4 common scene types: kitchen, living room, bedroom, and bathroom. 68 object each scene. The framework can be used for more fine-grained physical interactions

Target-driven Visual Navigation AI2-THOR framework VS others. AI2-THOR is closer to the real world.

Target-driven Visual Navigation Target-driven Navigation Model Deep reinforcement learning (DRL) models provide an end-to-end learning framework for transforming pixel information into actions. Standard DRL models aim at finding a direct mapping from state representations s to policy π(s). The goal is hardcoded in neural network parameters. Thus, changes in goals would require to update the network parameters in accordance.

Target-driven Visual Navigation Problem: Lack of generalization When incorporating new targets, a new model have to re-trained. The task objective is specified as inputs to the model instead of implanting the target in the model parameters. Action a at time t can be drawn by a ∼ π ( st, g | u ) Actions are conditioned on both states and targets. No re-training for new targets is required.

Target-driven Visual Navigation Target-driven Navigation Model Action space four actions: moving forward, moving backward, turning left, and turning right step length (0.5 meters) and turning angle (90 degree) Observations and Goals Both observations and goals are images taken by the agent’s RGB camera in its first- person view. The task objective is to navigate to the location and viewpoint where the target image is taken Reward design: minimizing the trajectory length to the navigation targets goal-reaching reward (10.0) a small time penalty (-0.01) as immediate reward

Target-driven Visual Navigation Targets across all scenes share the same generic siamese layers Targets within a scene share the same scene-specific layer

Target-driven Visual Navigation Target-driven Navigation Model A3C: Reinforcement learning model that learns by running multiple copies of training threads in parallel and updates a shared set of model parameters in an asynchronous manner. Each thread runs with a different navigation target Scene-specific layers: updated by gradients from the navigation tasks within the scene Generic siamese layers: updated by all targets

Target-driven Visual Navigation Target-driven Navigation Model Implemented in Tensorflow Trained on Nvidia GeForce GTX Titan X GPU Target-driven model learns better navigation policies compared to the state- of-the-art A3C methods after 100M training frames.

Target-driven Visual Navigation Generalizing to new targets within one scene All models are trained with 20M frames. Success rate: percentage of trajectories shorter than 500 steps Consistent trend of increasing success rate, as we increase the number of trained targets

Target-driven Visual Navigation Generalizing to new scenes Faster convergence as the number of trained scenes grows. Transferring generic data efficiency for learning in new environments, compared to training from scratch.

Target-driven Visual Navigation https://www.youtube.com/watch?v=SmBxMDiOrvs

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Existing vision and language methods (e.g. VQA) can be successfully applied to VLN

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Matterport3D Simulator Large-scale visual reinforcement learning simulation environment Matterport3D dataset

3 associated navigation instructions Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Room to Room Dataset 7,189 shortest paths from different (s, v) pairs. (Start and Goal) 400 AMT workers and 1,600 hours of Annotation time 21,567 navigation instructions, Average length 29 words 3 associated navigation instructions

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Room to Room Dataset Distribution based on first 4 words. Read from the center outwards. Arc length: word’s contribution

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Action Space: left, right, up, down, forward and stop, 30 degrees of turning angle State and Reward: similar to previous method Teacher-force LSTM-based sequence-to-sequence architecture with an attention mechanism Natural language instruction 𝑥 = 𝑥 1 , 𝑥 2 ,…, 𝑥 𝐿 Initial image observation 𝑜 0

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Language instruction encoding ℎ 𝑖 = LSTM 𝑒𝑛𝑐 ( 𝑥 𝑖 , ℎ 𝑖−1 ) ℎ = ℎ 1 , ℎ 2 ,…, ℎ 𝐿 -- encoder context used in the attention mechanism Image and action embedding Pretrained ResNet-152 extract feature of observation 𝑜 𝑡 Embed image and previous action to 𝑞 𝑡 ℎ′ 𝑡 = LSTM 𝑑𝑒𝑐 ( 𝑞 𝑡 , ℎ′ 𝑡−1 ) Action prediction with attention mechanism 𝑐 𝑡 =𝑓( ℎ′ 𝑡 , ℎ ) ℎ 𝑡 =tanh( 𝑊 𝐶 [ 𝑐 𝑡 ; ℎ′ 𝑡 ]) 𝑎 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥( ℎ 𝑡 )

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Student-force Online version of DAGGER A reduction of imitation learning and structured prediction to no-regret online learning

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments A sample output Blue discs indicate nearby (discretized) navigation options Instruction: Head upstairs and walk past the piano through an archway directly in front. Turn right when the hallway ends at pictures and table. Wait by the moose antlers hanging on the wall.

Summary and Discussion Recent researches mostly focus on Map-less approaches and VLN Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

Summary and Discussion Future work More actions, such as interact with items in the environment

Summary and Discussion Future work More actions, such as interact with items in the enviroment Augmented Reality

Thank you!