Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intrinsically Motivated Hierarchical Reinforcement Learning Andrew Barto Autonomous Learning Lab Department of Computer Science University of Massachusetts.

Similar presentations


Presentation on theme: "Intrinsically Motivated Hierarchical Reinforcement Learning Andrew Barto Autonomous Learning Lab Department of Computer Science University of Massachusetts."— Presentation transcript:

1 Intrinsically Motivated Hierarchical Reinforcement Learning Andrew Barto Autonomous Learning Lab Department of Computer Science University of Massachusetts Amherst Autonomous Learning Laboratory – Department of Computer Science

2 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 2 Overview  A few general thoughts on RL  Motivation: extrinsic and intrinsic  Intrinsically motivated RL  Where do hierarchies of options come from?  An (now old) example  Difficulties…

3 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 3 RL = Search + Memory  Search: Trial-and-Error, Generate-and-Test, Variation-and-Selection  Memory: remember what worked best for each situation and start from there next time

4 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 4 Variety+Selection  Trial-and-Error or Generate-and-Test  Learn a mapping from states or situations to actions: a policy actions evaluations Generator (Actor) Tester (Critic) states or situations

5 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 5 Smart Generators and Testers  Smart Tester: provides high-quality and timely evaluations  Value functions, Temporal Difference (TD) Learning, “Adaptive Critics”, “Q-Learning”  Smart Generator: should increase level of abstraction of search space with experience At its root, my story of intrinsic motivation and hierarchical RL is about making a smart adaptive generators

6 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 6 Smart Generators are Hierarchical patterns of musical notes chords chord sequences “Building Blocks” etc....

7 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 7 The Usual View of RL Primary Critic

8 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 8 The Less Misleading View Primary Critic

9 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 9 Motivation  “Forces” that energize an organism to act and that direct its activity.  Extrinsic Motivation: being moved to do something because of some external reward ($$, a prize, etc.).  Intrinsic Motivation: being moved to do something because it is inherently enjoyable.  Curiosity, Exploration, Manipulation, Play, Learning itself...

10 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 10 Intrinsic Motivation H. Harlow, “Learning and Satiation of Response in Intrinsically Motivated Complex Puzzle Performance”, Journal of Comparative and Physiological Psychology, Vol. 43, 1950

11 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 11 Harlow 1950 “A manipulation drive, strong and extremely persistent, is postulated to account for learning and maintenance of puzzle performance. It is further postulated that drives of this class represent a form of motivation which may be as primary and as important as the homeostatic drives.”

12 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 12 “Motivation Reconsidered: The Concept of Competence ” Psychological Review, Vol. 66, pp. 297–333, 1959. Critique of Hullian and Freudian drive theories that all behavior is motivated by biologically primal needs (food, drink, sex, escape, …) (either directly or through secondary reinforcement). Robert White’s famous 1959 paper  Competence: an organism’s capacity to interact effectively with its environment  Cumulative learning: significantly devoted to developing competence

13 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 13 D. E. Berlyne “As knowledge accumulated about the conditions that govern exploratory behavior and about how quickly it appears after birth, it seemed less and less likely that this behavior could be a derivative of hunger, thirst, sexual appetite, pain, fear of pain, and the like, or that stimuli sought through exploration are welcomed because they have previously accompanied satisfaction of these drives.” D. E. Berlyne, “Curiosity and Exploration”, Science, 1966

14 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 14 What is Intrinsically Rewarding?  novelty  surprise  salience  incongruity  manipulation  “being a cause”  mastery: being in control D. E. Berlyne’s writings are a rich source of data and suggestions  curiosity  exploration  …

15 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 15 An Example of Intrinsically Motivated RL Rich Sutton, “Integrated Architectures for Learning, Planning and Reacting based on Dynamic Programming” In Machine Learning: Proceedings of the Seventh International Workshop, 1990.  For each state and action, add a value to the usual immediate reward called the exploration bonus.  … a function of the time since that action was last executed in that state. The longer the time, the greater the assumed uncertainty, the greater the bonus.  Facilitates learning of environment model

16 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 16 Computational Curiosity Jürgen Schmidhuber, IDSIA Lugano “A Possibility for Implementing Curiosity and Boredom in Model- Building Neural Controllers” In From Animals to Animats 1991; “Adaptive Confidence and Adaptive Curiosity” Technical Report, 1991; “What’s Interesting?” Technical Report 1997  “The direct goal of curiosity and boredom is to improve a world model.....”  Curiosity Unit: reward is a function of the mismatch (Euclidean distance) between model’s current predictions and actuality. There is positive reinforcement whenever the system fails to correctly predict the environment.  Better, use cumulative prediction error changes.

17 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 17 Comp. Curiosity Cont.  Schmidhuber Cont.  Problem with that earlier idea (expectation mismatch as reward): Agent will focus on parts of environment that are inherently unpredictable. So it will be rewarded even though the model cannot improve. It won’t try to learn easier parts before learning hard parts  Instead of prediction error as reward, use cumulative prediction error changes.

18 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 18 Comp. Curiosity Cont.  More Schmidhuber:  “The important point is: The same complex mechanism which is used for ‘normal’ goal-directed learning is used for implementing curiosity and boredom. There is no need for devising a separate system....”  “My adaptive explorer continually wants … to focus on those novel things that seem easy to learn, given current knowledge. It wants to ignore (1) previously learned, predictable things, (2) inherently unpredictable ones (such as details of white noise on the screen), and (3) things that are unexpected but not expected to be easily learned (such as the contents of an advanced math textbook beyond the explorer’s current level).”

19 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 19 Comp. Curiosity Cont.  Objective is to efficiently learn an environment model  Reward: a measure of learning progress based on prediction error Frédéric Kaplan & Pierre-Yves Oudeyer, Sony CSL Paris e.g.“Intelligent adaptive curiosity: a source of self-development” Proceedings of the 4th International Workshop on Epigenetic Robotics, 2004.

20 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 20 Other examples of IMRL  Jürgen Schmidhuber: computational curiosity  Frédéric Kaplan & Pierre-Yves Oudeyer: learning progress  Xiao Huang & John Weng  Jim Marshall, Doug Blank, & Lisa Meeden  Rob Saunders  Mary Lou Maher  others...

21 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 21 What are features of IR?  IR depends only on internal state components  These components track aspects of agent’s history  IR can depend on current Q, V, , etc.  IR is task independent (were task is defined by extrinsic reward)(?)  IR is transient: e.g. based on prediction error  …  Most have goal of efficiently building “world model”

22 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 22 Where do Options come from?  Many can be hand-crafted from the start (and should be!)  How can an agent create useful options for itself?

23 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 23 Lots of Approaches  visit frequency and reward gradient [Digney 1998],  visit frequency on successful trajectories [McGovern & Barto 2001]  variable change frequency [Hengst 2002]  relative novelty [Simsek &Barto 2004]  salience [Singh et al. 2004]  clustering algorithms and value gradients [Mannor et al. 2004]  local graph partitioning [Simsek et al. 2005]  causal decomposition [Jonsson & Barto 2005]  exploit commonalities in collections of policies [Thrun & Schwartz 1995, Bernstein 1999, Perkins & Precup 1999, Pickett & Barto 2002] Many of these involve identifying subgoals

24 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 24 Creating Task-Independent Subgoals  Our approach: learn a collection of reusable skills  Subgoals = intrinsically rewarding events

25 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 25 A not-quite-so-simple example: Playroom Agent has eye, hand, visual marker Actions: move eye to hand move eye to marker move eye to random object move hand to eye move hand to marker move marker to eye move marker to hand If both eye and hand are on object: turn on light, push ball, etc. Singh, Barto, & Chentanez 2005

26 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 26 Playroom cont. Switch controls room lights Bell rings and moves one square if ball hits it Press blue/red block turns music on/off Lights have to be on to see colors Can push blocks Monkey laughs if bell and music both sound in dark room

27 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 27 Skills  To make monkey laugh:  Move eye to switch  Move hand to eye  Turn lights on  Move eye to blue block  Move hand to eye  Turn music on  Move eye to switch  Move hand to eye  Turn light off  Move eye to bell  Move marker to eye  Move eye to ball  Move hand to ball  Kick ball to make bell ring  Using skills (options)  Turn lights on  Turn music on  Turn lights off  Ring bell

28 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 28 Playroom Laugh Off (Monkey) Laugh On (Monkey)

29 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 29 Option Creation and Intrinsic Reward  Subgoals: events that are “intrinsically interesting”; here unexpected changes in lights and sounds  On first occurrence, create an option with that event as subgoal  Intrinsic reward generated whenever subgoal is achieved:  Proportional to the error in prediction of that event (“surprise”); so decreases with experience  Use a standard RL algorithm with R=IR+ER  Previously learned options are available as actions for learning policies of new options (Primitive actions always available too.)

30 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 30 Identified Options Help lights on lights off music on music off ring bell Learning to make the monkey laugh Given: an option for each of these subgoals, but no policies

31 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 31 Reward for Salient Events Music Monkey Lights Sound (bell)

32 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 32 Speed of Learning Various Skills

33 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 33 Learning to Make the Monkey Laugh

34 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 34 Shortcomings  Hand-crafted for our purposes  Pre-defined subgoals (based on “salience”)  Completely observable  Little state abstraction  Not very stochastic  No un-caused salient events  “Obsessive” behavior toward subgoals  Tries to use bad options  More

35 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 35 Access States  States that allow the agent to transition to a part of the state space that is otherwise unavailable or difficult to reach from its current region  Doorways, airports, elevators  Completion of a subtask  Building a new tool  cf. Hengst 2002; McGovern & Barto 2001; Menache, Mannor & Shimkin 2002; Jonsson & Barto 2006 Özgür Şimşek

36 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 36 Light on Music on Noise off Light off Music on Noise off Light off Music on Noise on Light on Music on Noise on Light on Music off Noise off Light off Music off Noise off Connectivity of Playroom States Özgür Şimşek

37 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 37 Relative Novelty Algorithm Şimşek & Barto 2004 Access states will frequently introduce short-term novelty in early stages of learning. Novelty (of state sequence S ) Relative Novelty (of state transition t ) Novelty of state sequence that followed t (of length lag) Novelty of state sequence that preceded t (of length lag) = =

38 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 38 Access States via Relative Novelty

39 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 39 Obsession with Subgoals  If the agent’s task is to get as much reward as possible, then what motivates it to build a widely-defined policy to be exploited later? G  As continuing task with a usual RL algorithm?  Episodic, e.g. with random restarts?  Optimistic initial values?  Counter based?  But what we want is a kind of on-line prioritized sweeping

40 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 40 Explore Now / Exploit Later  Our approach: Design an Intrinsic Reward mechanism to create a policy (behavior policy) that is efficient for learning a policy that is optimal for a different reward function (a task policy)  Two value functions are maintained:  One to solve the Task MDP (in the future): V T  Another one used to control behavior now: V B  V B predicts intrinsic reward, which we have to define to make this work

41 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 41 Intrinsically Motivated Exploration Şimşek & Barto 2006  The basic idea: Use changes in the task value function as intrinsic reward (for updating the behavior value function)

42 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 42 A Little Illustration Counter-based at each state agent takes action tried the least number of times (Thrun) Stochastic grid with success probability 0.9

43 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 43 A Test Actions: North, South, East, West; stochastic with success probability 0.9 Reward: –0.001 for each action + terminal reward Q-learning,  = 0.1,  = 0.99

44 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 44 Performance R: Random CB: Counter-based CP: Constant-Penalty

45 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 45 Connects with Previous RL Work  Schmidhuber  Kaplan and Oudeyer  Thrun and Moller  Sutton  Marshall, Blank, & Meeden  Duff  Others…. But these did not have the option framework and related algorithms available

46 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 46 Using Bad Options

47 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 47 Discoverng Causal Structure Chris Vigorito

48 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 48 A Robot Example UMass Laboratory for Perceptual Robotics Rod Grupen, director Rob Platt Shichao Ou Steve Hart John Sweeney “Dexter”

49 A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 49 Conclusions  Need for smart adaptive generators  Adaptive generators grow hierarchically  Intrinsic motivation is important for creating behavioral building blocks  RL+Options+Intrinsic reward is a natural way to do this  Development!  Theory?  Behavior?  Neuroscience?


Download ppt "Intrinsically Motivated Hierarchical Reinforcement Learning Andrew Barto Autonomous Learning Lab Department of Computer Science University of Massachusetts."

Similar presentations


Ads by Google