Presentation on theme: "Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto."— Presentation transcript:
Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto
Features, features, everywhere! We inhabit a world rich in sensory input Focus on features relevant to task at hand
Features, features, everywhere! We inhabit a world rich in sensory input Focus on features relevant to task at hand Two questions: 1.How do you characterize relevance? 2.How do you identify relevant features?
Outline Characterization of relevance –MDP homomorphisms –Factored Representations –Relativized Options Identifying features –Option schemas –Deictic Options
5 Markov Decision Processes MDP, M, is the tuple: –S : set of states. –A : set of actions. – : set of admissible state-action pairs. – : probability of transition. – : expected reward. Policy Maximize total expected reward –optimal policy
Notion of Equivalence N E S W Find reduced models that preserve some aspects of the original model
MDP Homomorphism hh agg.
Example N E S W State dependent action recoding
Some Theoretical Results Optimal Value equivalence: If then Solve homomorphic image and lift the policy to the original MDP. [generalizing those of Dean and Givan, 1997] Theorem: If is a homomorphic image of, then a policy optimal in induces an optimal policy in.
More results Polynomial time algorithm to find reduced images Dean and Givan 97, Lee and Yannakakis 92, Ravindran 04 Approximate homomorphisms Ravindran, Barto 04 –Bounds for the loss in the optimal value function Soft homomorphisms Sorg, Singh 09 –Fuzzy notions of equivalence between two MDPs –Efficient algorithm for finding them Transfer learning (Soni, Singh et al 06), partial observability (Wolfe 10), etc.
Still more results Symmetries are special cases of homomorphisms Matthur Ravindran 08 –Finding symmetries is GI-complete –Harder than finding general reductions Efficient algorithms for constructing the reduced image Matthur Ravindran 07 –Factored MDPs –Polynomial in the size of the smaller MDP
Attention? How to use this concept for modeling attention? Combine with hierarchical RL –Look at sub-task specific relevance –Structured homomorphisms –Deixis (δε ξις to point)
State and action spaces defined as product of features/variables. Factor transition probabilities. Exploit structure to define simple transformations. 2 Slice Temporal Bayes Net Factored MDPs
Using Factored Representations Represent symmetry information in terms of features. –Eg: As an example the NE-SW symmetry can be represented as Simple forms of transformations –Projections –Permutations
Hierarchical Reinforcement Learning Options framework Options (Sutton, Precup, & Singh, 1999) : A generalization of actions to include temporally-extended courses of action Example: robot docking : pre-defined controller : terminate when docked or charger not visible I : all states in which charger is in sight o
Gather all the red objects Five options – one for each room Sub-goal options Implicitly represent option policy Option MDPs related to each other Sub-goal Options
Relativized Options Relativized options (Ravindran and Barto 03) –Spatial abstraction - MDP Homomorphisms –Temporal abstraction – options framework Abstract representation of a related family of sub-tasks –Each sub-task can be derived by applying well defined transformations
Relativized Options (Cont) Relativized option: : Option homomorphism : Option MDP (Image of h) : Initiation set : Termination criterion reduced state action option Top level actions percept envenv
Single relativized option – get-object-exit-room Especially useful when learning option policy –Speed up –Knowledge transfer Terminology: Iba 89 Related to parameterized sub- tasks (Dietterich 00, Andre and Russell 01, 02) Rooms World Task
Option Schema Finding the right transformation? –Given a set of candidate transformations Option MDP and policy can be viewed as a policy schema (Schmidt 75) –Template of a policy –Acquire schema in a prototypical setting –Learn bindings of sensory inputs and actions to schema
Problem Formulation Given: – of a relativized option –, a family of transformations Identify the option homomorphism Formulate as a parameter estimation problem –One parameter for each sub-task, takes values from H –Samples: –Bayesian learning
Complex Game World Symmetric option MDP One delayer 40 transformations –8 spatial transformations combined with 5 projections Parameters of option MDP different from the rooms
Results Speed of Convergence Learning the policy is more difficult than learning the correct transformation!
Results Transformation Weights in Room 4 Transformation 12 eventually converges to 1
Results Transformation Weights in Room 2 Weights oscillate a lot Some transformation dominates eventually –Changes from one run to another
27 Deictic Representation Making attention more explicit Sense world via pointers – selective attention Actions defined with respect to pointers Agre 88 –Game domain Pengo Pointers can be arbitrarily complex –ice-cube-next-to-me –robot-chasing-me Move block to top of block.
Deixis and Abstraction Deictic pointers project states and actions onto some abstract state-action space Consistent Representation (Whitehead and Ballard 91) –states with same abstract representation have the same optimal value. –Lion algorithm, works with deterministic systems Extend relativized options to model deictic representation Ravindran Barto Mathew 07 –Factored MDPs –Restrict transformations available Only projections –Homomorphism conditions ensure consistent representation
Deictic Option Schema Deictic option schema: –O - A relativized option –K - A set of deictic pointers –D - A collection of sets of possible projections, one for each pointer Finding the correct transformation for a given state gives a consistent representation Use a factored version of a parameter estimation algorithm
Problem Formulation Given: – of a relativized option Identify the right pointer configuration for each sub-task Formulate as a parameter estimation problem –One parameter for each set of connected pointers per sub-task –Takes values from –Samples: –Heuristic modification of Bayesian learning
Heuristic Update Rule Use a heuristic update rule :
Game Domain 2 deictic pointers: delayer and retriever 2 fixed pointers: where-am-I and have-diamond 8 possible values for delayer and retriever –64 transformations
Experimental Setup Composite Agent –Uses 64 transformations and a single component weight vector Deictic Agent –Uses 2 component weight vector –8 transformations per component Hierarchical SMDP Q-learning
Experimental Results – Speed of Convergence
Experimental Results – Timing Execution Time –Composite – 14 hours and 29 minutes –Factored – 4 hours and 16 minutes
Experimental Results – Delayer Weights Mean 52 Std. Dev
Experimental Results – Retriever Weights Mean 3045 Std. Dev
Robosoccer Hard to learn a polciy for the entire game Look at simpler problems –Keepaway –Half-field offence Learn policy in a relative frame of reference Keep changing the bindings
Summary Richer representations needed for RL –Deictic representations Deictic Option Schemas –Combines ideas of hierarchical RL and MDP homomorphisms –Captures aspects of deictic representation
Future Build an integrated cognitive architecture, that uses both bottom-up and top-down information. –In perception –In decision making Combine aspects of supervised, unsupervised, reinforcement learning and planning approaches
Symmetry example. –Towers of Hanoi GoalStart Such a transformation that preserves the system properties is an automorphism. Group of all automorphisms is known as the symmetry group of the system.
More Structure In general, NP-hard –Polynomial time algorithm for computing homomorphic image, under certain assumptions Dean and Givan 97, Lee and Yannakakis 92, Ravindran 04 State dependent action recoding –Greater reduction in problem size –Model symmetries Reflections, rotations, permutations
Symmetry A symmetric system is one that is invariant under certain transformations onto itself. –Gridworld in earlier example, invariant under reflection along diagonal. N E S W N E S W
Train in room 1 8 candidate spatial transformations –Reflections about x and y axes and the x=y and x=-y lines –Rotations by integer multiples of 90 degrees Rooms World Task
Experimental Setup Relativized agents 1.Knows the right transformations 2.Chooses from 8 transformations Hierarchical SMDP Q-learning (Dietterich 00) –Q-learning at the lowest level (Watkins 89) –SMDP Q-learning at the higher level (Bradtke and Duff 95) –Simultaneous learning at all levels –Converges to recursively optimal policy
Results Speed of Convergence Not much of a difference since the correct transformation is identified in 15 iterations
Approximate Equivalence Exact homomorphisms rare in practice Relax equivalence criteria (Ravindran and Barto 02) Problem with Bayesian update –Use prototypical room as option MDP –Susceptible to incorrect samples
Heuristic Update Rule Use a heuristic update rule :
Complex Game World Gather all 4 diamonds in the world Benign and delayer robots Agent unaware of type of robots states
Experimental Setup Regular agent –4 separate options Relativized agent –Uses option MDP shown earlier –Chooses from 40 transformations Room 2 has no right transformation Hierarchical SMDP Q-learning (Dietterich 00) –Q-learning at the lowest level (Watkins 89) –SMDP Q-learning at the higher level (Bradtke and Duff 95)