Download presentation

Presentation is loading. Please wait.

1
**Selective attention in RL**

B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay Karanth, Andy Barto

2
**Features, features, everywhere!**

We inhabit a world rich in sensory input Focus on features relevant to task at hand

3
**Features, features, everywhere!**

We inhabit a world rich in sensory input Focus on features relevant to task at hand Two questions: How do you characterize relevance? How do you identify relevant features?

4
**Outline Characterization of relevance Identifying features**

MDP homomorphisms Factored Representations Relativized Options Identifying features Option schemas Deictic Options

5
**Markov Decision Processes**

MDP, M, is the tuple: S : set of states. A : set of actions. : set of admissible state-action pairs. : probability of transition. : expected reward. Policy Maximize total expected reward optimal policy

6
**Notion of Equivalence Find reduced models that preserve**

W Find reduced models that preserve some aspects of the original model

7
MDP Homomorphism h agg. Mention that for MDPs we drop the dependency on time for the transitions and the rewards.

8
Example N E S W State dependent action recoding

9
**Some Theoretical Results**

[generalizing those of Dean and Givan, 1997] Optimal Value equivalence: If then Solve homomorphic image and lift the policy to the original MDP. Theorem: If is a homomorphic image of , then a policy optimal in induces an optimal policy in

10
More results Polynomial time algorithm to find reduced images Dean and Givan ’97, Lee and Yannakakis ’92, Ravindran ‘04 Approximate homomorphisms Ravindran, Barto ‘04 Bounds for the loss in the optimal value function Soft homomorphisms Sorg, Singh ‘09 Fuzzy notions of equivalence between two MDPs Efficient algorithm for finding them Transfer learning (Soni, Singh et al ‘06), partial observability (Wolfe ‘10), etc.

11
Still more results Symmetries are special cases of homomorphisms Matthur Ravindran 08 Finding symmetries is GI-complete Harder than finding general reductions Efficient algorithms for constructing the reduced image Matthur Ravindran 07 Factored MDPs Polynomial in the size of the smaller MDP

12
**Attention? How to use this concept for modeling attention?**

Combine with hierarchical RL Look at sub-task specific relevance Structured homomorphisms Deixis (δεῖξιςto point)

13
**Factored MDPs 2 Slice Temporal Bayes Net**

State and action spaces defined as product of features/variables. Factor transition probabilities. Exploit structure to define simple transformations.

14
**Using Factored Representations**

Represent symmetry information in terms of features. Eg: As an example the NE-SW symmetry can be represented as Simple forms of transformations Projections Permutations

15
**Hierarchical Reinforcement Learning Options framework**

Options (Sutton, Precup, & Singh, 1999): A generalization of actions to include temporally-extended courses of action Example: robot docking : pre-defined controller : terminate when docked or charger not visible I : all states in which charger is in sight o

16
**Sub-goal Options Gather all the red objects**

Five options – one for each room Sub-goal options Implicitly represent option policy Option MDPs related to each other

17
**Relativized Options Relativized options (Ravindran and Barto ’03)**

Spatial abstraction - MDP Homomorphisms Temporal abstraction – options framework Abstract representation of a related family of sub-tasks Each sub-task can be derived by applying well defined transformations

18
**Relativized Options (Cont)**

reduced state percept Top level e n v option action actions Relativized option: : Option homomorphism : Option MDP (Image of h) : Initiation set : Termination criterion

19
**Rooms World Task Single relativized option – get-object-exit-room**

Especially useful when learning option policy Speed up Knowledge transfer Terminology: Iba ’89 Related to parameterized sub-tasks (Dietterich ’00, Andre and Russell ’01, 02)

20
**Option Schema Finding the right transformation?**

Given a set of candidate transformations Option MDP and policy can be viewed as a policy schema (Schmidt ’75) Template of a policy Acquire schema in a prototypical setting Learn bindings of sensory inputs and actions to schema

21
**Problem Formulation Given: Identify the option homomorphism**

of a relativized option , a family of transformations Identify the option homomorphism Formulate as a parameter estimation problem One parameter for each sub-task, takes values from H Samples: Bayesian learning

22
Algorithm Assume uniform prior: Experience: Update Posteriors:

23
**Complex Game World Symmetric option MDP One delayer 40 transformations**

8 spatial transformations combined with 5 projections Parameters of option MDP different from the rooms

24
**Results Speed of Convergence**

Learning the policy is more difficult than learning the correct transformation!

25
**Results Transformation Weights in Room 4**

Transformation 12 eventually converges to 1

26
**Results Transformation Weights in Room 2**

Weights oscillate a lot Some transformation dominates eventually Changes from one run to another

27
**Deictic Representation**

Move block to top of block Making attention more explicit Sense world via pointers – selective attention Actions defined with respect to pointers Agre ’88 Game domain Pengo Pointers can be arbitrarily complex ice-cube-next-to-me robot-chasing-me

28
**Deixis and Abstraction**

Deictic pointers project states and actions onto some abstract state-action space Consistent Representation (Whitehead and Ballard ’91) states with same abstract representation have the same optimal value. Lion algorithm, works with deterministic systems Extend relativized options to model deictic representation Ravindran Barto Mathew ‘07 Factored MDPs Restrict transformations available Only projections Homomorphism conditions ensure consistent representation

29
**Deictic Option Schema Deictic option schema:**

O - A relativized option K - A set of deictic pointers D - A collection of sets of possible projections, one for each pointer Finding the correct transformation for a given state gives a consistent representation Use a factored version of a parameter estimation algorithm

30
**Classes of Pointers Independent pointers Mutually dependent pointers**

31
**Problem Formulation Given:**

of a relativized option Identify the right pointer configuration for each sub-task Formulate as a parameter estimation problem One parameter for each set of connected pointers per sub-task Takes values from Samples: Heuristic modification of Bayesian learning

32
Heuristic Update Rule Use a heuristic update rule:

33
**Game Domain 2 deictic pointers: delayer and retriever**

2 fixed pointers: where-am-I and have-diamond 8 possible values for delayer and retriever 64 transformations

34
**Experimental Setup Composite Agent Deictic Agent**

Uses 64 transformations and a single component weight vector Deictic Agent Uses 2 component weight vector 8 transformations per component Hierarchical SMDP Q-learning

35
**Experimental Results – Speed of Convergence**

36
**Experimental Results – Timing**

Execution Time Composite – 14 hours and 29 minutes Factored – 4 hours and 16 minutes

37
**Experimental Results – Composite Weights**

Mean 2006 Std. Dev. 1673

38
**Experimental Results – Delayer Weights**

Mean 52 Std. Dev

39
**Experimental Results – Retriever Weights**

Mean 3045 Std. Dev

40
**Robosoccer Hard to learn a polciy for the entire game**

Look at simpler problems Keepaway Half-field offence Learn policy in a relative frame of reference Keep changing the bindings

49
**Summary Richer representations needed for RL Deictic Option Schemas**

Deictic representations Deictic Option Schemas Combines ideas of hierarchical RL and MDP homomorphisms Captures aspects of deictic representation

50
Future Build an integrated cognitive architecture, that uses both bottom-up and top-down information. In perception In decision making Combine aspects of supervised, unsupervised, reinforcement learning and planning approaches

51
**Symmetry example. Towers of Hanoi Goal Start**

Such a transformation that preserves the system properties is an automorphism. Group of all automorphisms is known as the symmetry group of the system.

52
**More Structure In general, NP-hard State dependent action recoding**

Polynomial time algorithm for computing homomorphic image, under certain assumptions Dean and Givan ’97, Lee and Yannakakis ’92, Ravindran ‘04 State dependent action recoding Greater reduction in problem size Model symmetries Reflections, rotations, permutations

53
Symmetry A symmetric system is one that is invariant under certain transformations onto itself. Gridworld in earlier example, invariant under reflection along diagonal. N E W E S N S W

54
**Rooms World Task Train in room 1 8 candidate spatial transformations**

Reflections about x and y axes and the x=y and x=-y lines Rotations by integer multiples of 90 degrees

55
**Experimental Setup Relativized agents**

Knows the right transformations Chooses from 8 transformations Hierarchical SMDP Q-learning (Dietterich ’00) Q-learning at the lowest level (Watkins ’89) SMDP Q-learning at the higher level (Bradtke and Duff ’95) Simultaneous learning at all levels Converges to recursively optimal policy

56
**Results Speed of Convergence**

Not much of a difference since the correct transformation is identified in 15 iterations

57
**Approximate Equivalence**

Exact homomorphisms rare in practice Relax equivalence criteria (Ravindran and Barto ’02) Problem with Bayesian update Use prototypical room as option MDP Susceptible to incorrect samples

58
Example

59
Heuristic Update Rule Use a heuristic update rule:

60
**Complex Game World Gather all 4 diamonds in the world**

Benign and delayer “robots” Agent unaware of type of robots states

61
**Experimental Setup Regular agent Relativized agent**

4 separate options Relativized agent Uses option MDP shown earlier Chooses from 40 transformations Room 2 has no right transformation Hierarchical SMDP Q-learning (Dietterich ’00) Q-learning at the lowest level (Watkins ’89) SMDP Q-learning at the higher level (Bradtke and Duff ’95)

Similar presentations

OK

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

© 2019 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google