Presentation on theme: "Selective attention in RL"— Presentation transcript:
1 Selective attention in RL B. RavindranJoint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto
2 Features, features, everywhere! We inhabit a world rich in sensory inputFocus on features relevant to task at hand
3 Features, features, everywhere! We inhabit a world rich in sensory inputFocus on features relevant to task at handTwo questions:How do you characterize relevance?How do you identify relevant features?
4 Outline Characterization of relevance Identifying features MDP homomorphismsFactored RepresentationsRelativized OptionsIdentifying featuresOption schemasDeictic Options
5 Markov Decision Processes MDP, M, is the tuple:S : set of states.A : set of actions.: set of admissible state-action pairs.: probability of transition.: expected reward.PolicyMaximize total expected rewardoptimal policy
6 Notion of Equivalence Find reduced models that preserve WFind reduced models that preservesome aspects of the original model
7 MDP Homomorphismhagg.Mention that for MDPs we drop the dependency on time for the transitions and the rewards.
9 Some Theoretical Results [generalizing those of Dean and Givan, 1997]Optimal Value equivalence:If thenSolve homomorphic image and lift the policy to the original MDP.Theorem: If is a homomorphic image of ,then a policy optimal in induces an optimalpolicy in
10 More resultsPolynomial time algorithm to find reduced images Dean and Givan ’97, Lee and Yannakakis ’92, Ravindran ‘04Approximate homomorphisms Ravindran, Barto ‘04Bounds for the loss in the optimal value functionSoft homomorphisms Sorg, Singh ‘09Fuzzy notions of equivalence between two MDPsEfficient algorithm for finding themTransfer learning (Soni, Singh et al ‘06), partial observability (Wolfe ‘10), etc.
11 Still more resultsSymmetries are special cases of homomorphisms Matthur Ravindran 08Finding symmetries is GI-completeHarder than finding general reductionsEfficient algorithms for constructing the reduced image Matthur Ravindran 07Factored MDPsPolynomial in the size of the smaller MDP
12 Attention? How to use this concept for modeling attention? Combine with hierarchical RLLook at sub-task specific relevanceStructured homomorphismsDeixis (δεῖξιςto point)
13 Factored MDPs 2 Slice Temporal Bayes Net State and action spaces defined as product of features/variables.Factor transition probabilities.Exploit structure to define simple transformations.
14 Using Factored Representations Represent symmetry information in terms of features.Eg: As an example the NE-SW symmetry can be represented asSimple forms of transformationsProjectionsPermutations
15 Hierarchical Reinforcement Learning Options framework Options (Sutton, Precup, & Singh, 1999): A generalization ofactions to include temporally-extended courses of actionExample: robot docking : pre-defined controller : terminate when docked or charger not visibleI : all states in which charger is in sighto
16 Sub-goal Options Gather all the red objects Five options – one for each roomSub-goal optionsImplicitly represent option policyOption MDPs related to each other
17 Relativized Options Relativized options (Ravindran and Barto ’03) Spatial abstraction - MDP HomomorphismsTemporal abstraction – options frameworkAbstract representation of a related family of sub-tasksEach sub-task can be derived by applying well defined transformations
19 Rooms World Task Single relativized option – get-object-exit-room Especially useful when learning option policySpeed upKnowledge transferTerminology: Iba ’89Related to parameterized sub-tasks (Dietterich ’00, Andre and Russell ’01, 02)
20 Option Schema Finding the right transformation? Given a set of candidate transformationsOption MDP and policy can be viewed as a policy schema (Schmidt ’75)Template of a policyAcquire schema in a prototypical settingLearn bindings of sensory inputs and actions to schema
21 Problem Formulation Given: Identify the option homomorphism of a relativized option, a family of transformationsIdentify the option homomorphismFormulate as a parameter estimation problemOne parameter for each sub-task, takes values from HSamples:Bayesian learning
23 Complex Game World Symmetric option MDP One delayer 40 transformations 8 spatial transformations combined with 5 projectionsParameters of option MDP different from the rooms
24 Results Speed of Convergence Learning the policy is more difficult than learning the correct transformation!
25 Results Transformation Weights in Room 4 Transformation 12 eventually converges to 1
26 Results Transformation Weights in Room 2 Weights oscillate a lotSome transformation dominates eventuallyChanges from one run to another
27 Deictic Representation Move block to top of blockMaking attention more explicitSense world via pointers – selective attentionActions defined with respect to pointersAgre ’88Game domain PengoPointers can be arbitrarily complexice-cube-next-to-merobot-chasing-me
28 Deixis and Abstraction Deictic pointers project states and actions onto some abstract state-action spaceConsistent Representation (Whitehead and Ballard ’91)states with same abstract representation have the same optimal value.Lion algorithm, works with deterministic systemsExtend relativized options to model deictic representation Ravindran Barto Mathew ‘07Factored MDPsRestrict transformations availableOnly projectionsHomomorphism conditions ensure consistent representation
29 Deictic Option Schema Deictic option schema: O - A relativized optionK - A set of deictic pointersD - A collection of sets of possible projections, one for each pointerFinding the correct transformation for a given state gives a consistent representationUse a factored version of a parameter estimation algorithm
30 Classes of Pointers Independent pointers Mutually dependent pointers
31 Problem Formulation Given: of a relativized optionIdentify the right pointer configuration for each sub-taskFormulate as a parameter estimation problemOne parameter for each set of connected pointers per sub-taskTakes values fromSamples:Heuristic modification of Bayesian learning
32 Heuristic Update RuleUse a heuristic update rule:
33 Game Domain 2 deictic pointers: delayer and retriever 2 fixed pointers: where-am-I and have-diamond8 possible values for delayer and retriever64 transformations
34 Experimental Setup Composite Agent Deictic Agent Uses 64 transformations and a single component weight vectorDeictic AgentUses 2 component weight vector8 transformations per componentHierarchical SMDP Q-learning
49 Summary Richer representations needed for RL Deictic Option Schemas Deictic representationsDeictic Option SchemasCombines ideas of hierarchical RL and MDP homomorphismsCaptures aspects of deictic representation
50 FutureBuild an integrated cognitive architecture, that uses both bottom-up and top-down information.In perceptionIn decision makingCombine aspects of supervised, unsupervised, reinforcement learning and planning approaches
51 Symmetry example. Towers of Hanoi Goal Start Such a transformation that preserves the system properties is anautomorphism.Group of all automorphisms is known as the symmetry group of the system.
52 More Structure In general, NP-hard State dependent action recoding Polynomial time algorithm for computing homomorphic image, under certain assumptionsDean and Givan ’97, Lee and Yannakakis ’92, Ravindran ‘04State dependent action recodingGreater reduction in problem sizeModel symmetriesReflections, rotations, permutations
53 SymmetryA symmetric system is one that is invariant under certain transformations onto itself.Gridworld in earlier example, invariant under reflection along diagonal.NEWESNSW
54 Rooms World Task Train in room 1 8 candidate spatial transformations Reflections about x and y axes and the x=y and x=-y linesRotations by integer multiples of 90 degrees
55 Experimental Setup Relativized agents Knows the right transformationsChooses from 8 transformationsHierarchical SMDP Q-learning (Dietterich ’00)Q-learning at the lowest level (Watkins ’89)SMDP Q-learning at the higher level (Bradtke and Duff ’95)Simultaneous learning at all levelsConverges to recursively optimal policy
56 Results Speed of Convergence Not much of a difference since the correcttransformation is identified in 15 iterations
57 Approximate Equivalence Exact homomorphisms rare in practiceRelax equivalence criteria (Ravindran and Barto ’02)Problem with Bayesian updateUse prototypical room as option MDPSusceptible to incorrect samples
59 Heuristic Update RuleUse a heuristic update rule:
60 Complex Game World Gather all 4 diamonds in the world Benign and delayer “robots”Agent unaware of type of robotsstates
61 Experimental Setup Regular agent Relativized agent 4 separate optionsRelativized agentUses option MDP shown earlierChooses from 40 transformationsRoom 2 has no right transformationHierarchical SMDP Q-learning (Dietterich ’00)Q-learning at the lowest level (Watkins ’89)SMDP Q-learning at the higher level (Bradtke and Duff ’95)