Presentation on theme: "Selective attention in RL"— Presentation transcript:
1Selective attention in RL B. RavindranJoint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto
2Features, features, everywhere! We inhabit a world rich in sensory inputFocus on features relevant to task at hand
3Features, features, everywhere! We inhabit a world rich in sensory inputFocus on features relevant to task at handTwo questions:How do you characterize relevance?How do you identify relevant features?
4Outline Characterization of relevance Identifying features MDP homomorphismsFactored RepresentationsRelativized OptionsIdentifying featuresOption schemasDeictic Options
5Markov Decision Processes MDP, M, is the tuple:S : set of states.A : set of actions.: set of admissible state-action pairs.: probability of transition.: expected reward.PolicyMaximize total expected rewardoptimal policy
6Notion of Equivalence Find reduced models that preserve WFind reduced models that preservesome aspects of the original model
7MDP Homomorphismhagg.Mention that for MDPs we drop the dependency on time for the transitions and the rewards.
9Some Theoretical Results [generalizing those of Dean and Givan, 1997]Optimal Value equivalence:If thenSolve homomorphic image and lift the policy to the original MDP.Theorem: If is a homomorphic image of ,then a policy optimal in induces an optimalpolicy in
10More resultsPolynomial time algorithm to find reduced images Dean and Givan ’97, Lee and Yannakakis ’92, Ravindran ‘04Approximate homomorphisms Ravindran, Barto ‘04Bounds for the loss in the optimal value functionSoft homomorphisms Sorg, Singh ‘09Fuzzy notions of equivalence between two MDPsEfficient algorithm for finding themTransfer learning (Soni, Singh et al ‘06), partial observability (Wolfe ‘10), etc.
11Still more resultsSymmetries are special cases of homomorphisms Matthur Ravindran 08Finding symmetries is GI-completeHarder than finding general reductionsEfficient algorithms for constructing the reduced image Matthur Ravindran 07Factored MDPsPolynomial in the size of the smaller MDP
12Attention? How to use this concept for modeling attention? Combine with hierarchical RLLook at sub-task specific relevanceStructured homomorphismsDeixis (δεῖξιςto point)
13Factored MDPs 2 Slice Temporal Bayes Net State and action spaces defined as product of features/variables.Factor transition probabilities.Exploit structure to define simple transformations.
14Using Factored Representations Represent symmetry information in terms of features.Eg: As an example the NE-SW symmetry can be represented asSimple forms of transformationsProjectionsPermutations
15Hierarchical Reinforcement Learning Options framework Options (Sutton, Precup, & Singh, 1999): A generalization ofactions to include temporally-extended courses of actionExample: robot docking : pre-defined controller : terminate when docked or charger not visibleI : all states in which charger is in sighto
16Sub-goal Options Gather all the red objects Five options – one for each roomSub-goal optionsImplicitly represent option policyOption MDPs related to each other
17Relativized Options Relativized options (Ravindran and Barto ’03) Spatial abstraction - MDP HomomorphismsTemporal abstraction – options frameworkAbstract representation of a related family of sub-tasksEach sub-task can be derived by applying well defined transformations
19Rooms World Task Single relativized option – get-object-exit-room Especially useful when learning option policySpeed upKnowledge transferTerminology: Iba ’89Related to parameterized sub-tasks (Dietterich ’00, Andre and Russell ’01, 02)
20Option Schema Finding the right transformation? Given a set of candidate transformationsOption MDP and policy can be viewed as a policy schema (Schmidt ’75)Template of a policyAcquire schema in a prototypical settingLearn bindings of sensory inputs and actions to schema
21Problem Formulation Given: Identify the option homomorphism of a relativized option, a family of transformationsIdentify the option homomorphismFormulate as a parameter estimation problemOne parameter for each sub-task, takes values from HSamples:Bayesian learning
23Complex Game World Symmetric option MDP One delayer 40 transformations 8 spatial transformations combined with 5 projectionsParameters of option MDP different from the rooms
24Results Speed of Convergence Learning the policy is more difficult than learning the correct transformation!
25Results Transformation Weights in Room 4 Transformation 12 eventually converges to 1
26Results Transformation Weights in Room 2 Weights oscillate a lotSome transformation dominates eventuallyChanges from one run to another
27Deictic Representation Move block to top of blockMaking attention more explicitSense world via pointers – selective attentionActions defined with respect to pointersAgre ’88Game domain PengoPointers can be arbitrarily complexice-cube-next-to-merobot-chasing-me
28Deixis and Abstraction Deictic pointers project states and actions onto some abstract state-action spaceConsistent Representation (Whitehead and Ballard ’91)states with same abstract representation have the same optimal value.Lion algorithm, works with deterministic systemsExtend relativized options to model deictic representation Ravindran Barto Mathew ‘07Factored MDPsRestrict transformations availableOnly projectionsHomomorphism conditions ensure consistent representation
29Deictic Option Schema Deictic option schema: O - A relativized optionK - A set of deictic pointersD - A collection of sets of possible projections, one for each pointerFinding the correct transformation for a given state gives a consistent representationUse a factored version of a parameter estimation algorithm
30Classes of Pointers Independent pointers Mutually dependent pointers
31Problem Formulation Given: of a relativized optionIdentify the right pointer configuration for each sub-taskFormulate as a parameter estimation problemOne parameter for each set of connected pointers per sub-taskTakes values fromSamples:Heuristic modification of Bayesian learning
32Heuristic Update RuleUse a heuristic update rule:
33Game Domain 2 deictic pointers: delayer and retriever 2 fixed pointers: where-am-I and have-diamond8 possible values for delayer and retriever64 transformations
34Experimental Setup Composite Agent Deictic Agent Uses 64 transformations and a single component weight vectorDeictic AgentUses 2 component weight vector8 transformations per componentHierarchical SMDP Q-learning
49Summary Richer representations needed for RL Deictic Option Schemas Deictic representationsDeictic Option SchemasCombines ideas of hierarchical RL and MDP homomorphismsCaptures aspects of deictic representation
50FutureBuild an integrated cognitive architecture, that uses both bottom-up and top-down information.In perceptionIn decision makingCombine aspects of supervised, unsupervised, reinforcement learning and planning approaches
51Symmetry example. Towers of Hanoi Goal Start Such a transformation that preserves the system properties is anautomorphism.Group of all automorphisms is known as the symmetry group of the system.
52More Structure In general, NP-hard State dependent action recoding Polynomial time algorithm for computing homomorphic image, under certain assumptionsDean and Givan ’97, Lee and Yannakakis ’92, Ravindran ‘04State dependent action recodingGreater reduction in problem sizeModel symmetriesReflections, rotations, permutations
53SymmetryA symmetric system is one that is invariant under certain transformations onto itself.Gridworld in earlier example, invariant under reflection along diagonal.NEWESNSW
54Rooms World Task Train in room 1 8 candidate spatial transformations Reflections about x and y axes and the x=y and x=-y linesRotations by integer multiples of 90 degrees
55Experimental Setup Relativized agents Knows the right transformationsChooses from 8 transformationsHierarchical SMDP Q-learning (Dietterich ’00)Q-learning at the lowest level (Watkins ’89)SMDP Q-learning at the higher level (Bradtke and Duff ’95)Simultaneous learning at all levelsConverges to recursively optimal policy
56Results Speed of Convergence Not much of a difference since the correcttransformation is identified in 15 iterations
57Approximate Equivalence Exact homomorphisms rare in practiceRelax equivalence criteria (Ravindran and Barto ’02)Problem with Bayesian updateUse prototypical room as option MDPSusceptible to incorrect samples
59Heuristic Update RuleUse a heuristic update rule:
60Complex Game World Gather all 4 diamonds in the world Benign and delayer “robots”Agent unaware of type of robotsstates
61Experimental Setup Regular agent Relativized agent 4 separate optionsRelativized agentUses option MDP shown earlierChooses from 40 transformationsRoom 2 has no right transformationHierarchical SMDP Q-learning (Dietterich ’00)Q-learning at the lowest level (Watkins ’89)SMDP Q-learning at the higher level (Bradtke and Duff ’95)