# NIPS 2007 Workshop Welcome! Hierarchical organization of behavior Thank you for coming Apologies to the skiers… Why we will be strict about timing Why.

## Presentation on theme: "NIPS 2007 Workshop Welcome! Hierarchical organization of behavior Thank you for coming Apologies to the skiers… Why we will be strict about timing Why."— Presentation transcript:

NIPS 2007 Workshop Welcome! Hierarchical organization of behavior Thank you for coming Apologies to the skiers… Why we will be strict about timing Why we want the workshop to be interactive

Rewards/punishments may be delayed Outcomes may depend on sequence of actions  Credit assignment problem RL: Decision making Goal: maximize reward (minimize punishment)

RL in a nutshell: formalization states - actions - transitions - rewards - policy - long term values Components of an RL task Policy: p(S,a) State values: V(S) State-action values: Q(S,a) Policy: p(S,a) State values: V(S) State-action values: Q(S,a) S1S1 S3S3 S2S2 4022 R L

RL in a nutshell: forward search S1S1 S3S3 S2S2 L R L R L R = 4 = 0 = 2 Model based RL learn model through experience (cognitive map) choosing actions is hard goal directed behavior; cortical Model = T(ransitions) and R(ewards) S1S1 S3S3 S2S2 4022 R L

Trick #1: Long-term values are recursive Q(S,a) = r(S,a) + V(S next ) Trick #1: Long-term values are recursive Q(S,a) = r(S,a) + V(S next ) RL in a nutshell: cached values Model-free RL temporal difference learning Q(S,a) = r(S,a) + max Q(S’,a’) TD learning: start with initial (wrong) Q(S,a) PE = r(S,a) + max Q(S’,a’) - Q(S,a) Q(S,a) new = Q(S,a) old +  PE S1S1 S3S3 S2S2 4022 R L

RL in a nutshell: cached values Model-free RL choosing actions is easy (but need lots of practice to learn) habitual behavior; basal ganglia temporal difference learning S1S1 S3S3 S2S24022 R L Trick #2: Can learn values without a model Q(S 1,L) 4 Q(S 1,R) 2 Q(S 2,L) 4 Q(S 2,R) 0 Q(S 3,L) 2 Q(S 3,R) 2

RL in real world tasks… model based vs. model free learning and control Q(S 1,L) 4 Q(S 1,R) 2 Q(S 2,L) 4 Q(S 2,R) 0 Q(S 3,L) 2 Q(S 3,R) 2 S1S1 S3S3 S2S2 L R L R L R = 4 = 0 = 2 S1S1 S3S3 S2S24022 R L Scaling problem!

Real-world behavior is hierarchical Hierarchical RL: What is it? 1. set water temp 2. get wet 3. shampoo 4. soap 5. turn off water 6. dry off add hot success add cold wait 5sec too cold too hot change just right simplified control, disambiguation, encapsulation 1. pour coffee 2. add sugar 3. add milk 4. stir

HRL: (in)formal framework Termination condition = (sub)goal state Option policy learning: via pseudo reward (model based or model free) Hierarchical RL: What is it? options - skills - macros - temporally abstract actions (Sutton, McGovern, Dietterich, Barto, Precup, Singh, Parr…) Option: set water temperature S1S1 S2S2 S8S8 … S1S1 0.8 0.1 S2S2 0.8 S3S3 0 1 0 S 1 (0.1) S 2 (0.1) S 3 (0.9) … initiation set policy termination conditions

S: startG: goal Options: going to doors Actions: + 2 door options HRL: a toy example Hierarchical RL: What is it?

Advantages of HRL 1. Faster learning (mitigates scaling problem) Hierarchical RL: What is it? RL: no longer ‘tabula rasa’ 2. Transfer of knowledge from previous tasks (generalization, shaping)

Disadvantages (or: the cost) of HRL Hierarchical RL: What is it? 1.Need ‘right’ options - how to learn them? 2.Suboptimal behavior (“negative transfer”; habits) 3.More complex learning/control structure no free lunches…

Download ppt "NIPS 2007 Workshop Welcome! Hierarchical organization of behavior Thank you for coming Apologies to the skiers… Why we will be strict about timing Why."

Similar presentations