Presentation is loading. Please wait.

Presentation is loading. Please wait.

From Reflex to Reason Rich Sutton AT&T Labs with thanks to Satinder Singh, Doina Precup, and Andy Barto.

Similar presentations


Presentation on theme: "From Reflex to Reason Rich Sutton AT&T Labs with thanks to Satinder Singh, Doina Precup, and Andy Barto."— Presentation transcript:

1 From Reflex to Reason Rich Sutton AT&T Labs with thanks to Satinder Singh, Doina Precup, and Andy Barto

2 Overall Goal A computational understanding of a broad span of the mind’s activities what it computes why it computes it At a high level, without specifics of sensory and motor systems specific representations and algorithms neural implementations language What does the mind do? Is there an overall, simple answer? Marr’s 3 levels

3 Main Claims Mind is about predictions –making predictions –discovering what predictions can be made Knowledge is predictions –action-contingent and temporally-flexible predictions –agent-centric, grounded in experience from the bottom up The mind’s ultimate goal is to make reward- maximizing decisions –but most of its effort is devoted to subgoal of prediction A few simple mechanisms enable working flexibly with predictions –TD learning and Bellman backups Prediction Semantics

4 A prediction is a signal with meaning Knowing that one signal is a prediction of another enables it to do useful work for you When something new predicts X, you know what to do Prediction semantics constrains in two directions Pred. of X ResponseY X existing link new link

5 Outline/Steps Reflexes and their conditioning Learning to get reward Planning, by mental simulation Knowledge, as temporally flexible predictions Reason, as flexible use of knowledge These together are much of what the mind does Can we explain them all in a uniform way?

6 Pavlovian Conditioning, the Conditioning of Reflexes CS Tone Eyeshock Eyeblink before learning Eyeblink after learning US UR CR Almost any reflex can be conditioned: salivation orienting heart rate, blood pressure gill withdrawall nausea, taste aversion fear, secondary reinforcers CER: freezing, suppression neutral stimuli Animal can be viewed as learning that the CS predicts the US And then responding in anticipation But Why? Why should a prediction of the US produce the same response as the US? (No US)

7 (Inadequate) Comp. Theories of CC Instrumental theories -- the CR makes the US feel better –Works well for eyeblink, salivation, not for 2ndary reinforcers –Does not explain the similarity of CR and UR –Does not explain apparent conflict of CC and instrumental Anticipation theories -- whatever you are going to do, CC causes you to do it earlier –Why earlier? Earlier is not always better! –How much earlier? CR tends to occur at time of US Prediction theories -- CC is learning to predict the US –Works for fear, CER, 2ndary reinforcers –Does not explain response to CR or to UR –Explains “What” but not “Why”

8 Pred Rep’n Theory of Conditioning The reflex is not US  Response: US R reflex CS learnable NOT OR US R reflex CS learnable But Prediction of US  Response: Pred of US R reflex CS learnable US learnable + USs could habituate!

9 Pred Rep’n Theory of Conditioning (2) Consider an innate, learnable association US  Response –represents an innate guess, e.g., that a shock now is good predictor of a shock coming up –but could be wrong –Predicts URs could habituate, change over time depending on their relationship to themselves US supervisory cue CS prediction of supervisory US Response Long USs predict themselves Short USs are poor self-predictors * * * *

10 Pred Rep’n Theory of Conditioning (3) Implications for response topography/generation –predicts maximal CR at time of US onset (correct) –predicts CR onset only so early as to enable this –predicts threshold phenomena in CR production –predicts interaction of threshold with relative effectiveness of reinforced and unreinforced trials US CR response topography

11 Outline/Steps Reflexes and their conditioning Learning to get reward Planning, by mental simulation Knowledge, as temporally flexible predictions Reason, as flexible use of knowledge

12 The Reward Hypothesis Is this reasonable? Is it demeaning? Is there no other choice? It seems to be adequate and perhaps completely satisfactory That purposes can be adequately represented as maximization of the cumulative sum of a scalar reward signal received from the environment

13 Reinforcement Learning Theory: What to Compute and Why Policies  : States  Pr(Actions) Value Functions V  ( s )  E  t  1 reward t t  1   start in s 0, follow  1-Step Models  s Predictions!

14 Honeybee Brain & V UM Neuron Hammer, Menzel

15 The Acrobot Problem e.g., Dejong & Spong, 1994 Sutton, 1995 Minimum–Time–to–Goal: 4 state variables: 2 joint angles 2 angular velocities CMAC of 48 layers RL same as Mountain Car         Goal: Raise tip above line Torque applied here tip Reward = -1 per time step fixed base

16 Prediction Semantics of RL Value, a pred. of reward reward fixed link Action Selection Pick the highest valued action Representations of state and action learned links An action that predicts reward in a state... should to that extent be favored in that state

17 Examples of Reinforcement Learning Robocup Soccer Teams Stone & Veloso, Reidmiller et al. –World’s best player of simulated soccer, 1999; Runner-up 2000 Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis –10-15% improvement over industry standard methods Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin –World's best assigner of radio channels to mobile telephone calls Elevator Control Crites & Barto –(Probably) world's best down-peak elevator controller Many Robots –navigation, bi-pedal walking, grasping, switching between skills... TD-Gammon and Jellyfish Tesauro, Dahl –World's best backgammon player

18 TD-Gammon... Value TD Error V t  1  V t Action selection by 2-3 ply search Tesauro, Start with a random Network Play millions of games against itself Learn a value function from this simulated experience This produces arguably the best player in the world

19 Prediction Semantics in TD-Gammon A prediction of winning can substitute for winning –the central idea of Temporal-Difference (TD) learning learning a prediction from a prediction! –also key idea of dynamic programming –and all heuristic search In lookahead search, predictions are composed to produce longer-term predictions –key to all state-space planning –suggests prediction semantics is key element of reasoning

20 Outline/Steps Reflexes and their conditioning Learning to get reward Planning, by mental simulation Knowledge, as temporally flexible predictions Reason, as flexible use of knowledge

21 Planning as RL over Mental Simulation 1. Learn a model of the world’s transition dynamics transition probabilities, expected immediate rewards “1-step model” of the world 2. Use model to generate imaginary experiences internal thought trials, mental simulation (Craik, 1943) 3. Apply RL as if experience had really happened Reward Value Function 1-Step Model Policy I.e., learning on model-generated experience:

22 Dyna Algorithm 1. s  current state 2. Choose an action, a, and take it 3. Receive next state, s’, and reward, r 4. Apply RL backup to s, a, s’, r e.g., Q-learning update 5. Update Model( s, a ) with s’, r 6. Repeat k times: - select a previously seen state-action pair s,a - s’, r  Model( s, a ) - Apply RL backup to s, a, s’, r 7. Go to 1 value/policy modelexperience acting model learning direct RL planning

23 State-Space Search is based on a Prediction Semantics in seeking to evaluate this state we use predictions from these

24 Prediction Semantics in Planning is just like in TD-Gammon Predictions substitute for path outcomes Predictions are composed to predict consequences of arbitrary sequences of action

25 Naïve RL Theory of Reason Reward Value Function 1-Step Model Policy Reason is RL on model-generated experience Pro: –Very simple, uniform, general –Sufficient to reproduce e.g., latent learning Con –Seems too low-level –Represents only a limited kind of knowledge

26 Outline/Steps Reflexes and their conditioning Learning to get reward Planning, by mental simulation Knowledge, as temporally flexible predictions Reason, as flexible use of knowledge

27 Experience AgentWorld actions observations   a t  3, a t  2, a t  1, a t  ?  o t  3, o t  2, o t  1, o t  ?? Actions: Observations: Experience is the data; it is all we really know Experience provides something for knowledge to be about A mind interacts with its world To produce two time series: Experience

28 The world is a black box, known only by its I/O behavior (observations in response to actions) Therefore, all meaningful statements about the world are statements about the observations it generates The only observations worth talking about are future ones The only meaningful things to say about the world are predictions World Knowledge  Predictions Therefore: Predictions = statements about the joint distribution of future observations and actions

29 Non-predictive “Knowledge” Mathematical knowledge, theorems and proofs –always true, but tell us nothing about the world –not world knowledge Uninterpretted signals, e.g., useful representations –real and useful, but not by themselves world knowledge, only an aid to acquiring it Knowledge of the past Policies –could be viewed as predictions of value –but by themselves are more like uninterpretted signals Predictions capture “regular”, descriptive world knowledge

30 Every Prediction must be Grounded in Two Directions if I do action 1, then obs 12 will be 0 for three steps history of actions & observations recognition grounding prediction prediction grounding “symbol grounding” “prediction semantics”

31 Both Recognition and Prediction Grounding are Needed “Classical” AI systems omit recognition grounding –e.g., “Tweety is a bird”, “John loves Mary” –sometimes called the “symbol grounding problem” Modern AI sytems tend to skimp prediction grounding –supervised learning, Bayes nets, robotics… It is not OK to leave prediction grounding to external, human observers –the information is just not in the machine –we don’t understand it; we haven’t done our job! Yet this is such an appealing shortcut that we have almost always done it

32 Prediction Semantics formalized as Macros-Actions Let  : States  Pr(Actions) be an arbitrary policy Let  : States  Pr({0,1}) be a termination condition Then macro-action is a kind of experiment – do  until  says “stop” – measure something about the resulting experience Suppose we measure – the state at the end of the experiment – the total reward during the experiment Then the macro prediction for would predict Pr(end-state), E{total reward} given start-state Predictions of this form can represent a lot......possibly all world knowledge Sutton, Precup & Singh AIJ 1999 etal.

33 Rooms Example o 2 HALLWAYS o 1 up down rightleft (to each room's 2 hallways) G 2 Fail 33% of the time G 1 Policy of one macro-action: Sutton, Precup, & Singh, multi-step macro-actions 4 stochastic primitive actions

34 Planning with Macro-Predictions macro-actions

35 Learning Path-to-Goal with and without Hallway Macros-Actions

36 Illustration: Reconnaissance Mission Planning (Problem) Mission: Fly over (observe) most valuable sites and return to base Stochastic weather affects observability (cloudy or clear) of sites Limited fuel Intractable with classical optimal control methods Temporal scales: –Actions: which direction to fly now –Options: which site to head for Options compress space and time –Reduce steps from ~600 to ~6 –Reduce states from ~10 11 to ~10 6 any state (10 6 ) sites only (6)

37 Illustration: Reconnaissance Mission Planning (Results) SMDP planner: –Assumes options followed to completion –Plans optimal SMDP solution SMDP planner with re-evaluation –Plans as if options must be followed to completion –But actually takes them for only one step –Re-picks a new option on every step Static planner: –Assumes weather will not change –Plans optimal tour among clear sites –Re-plans whenever weather changes Low Fuel High Fuel Expected Reward/Mission SMDP Planner Static Re-planner SMDP planner with re-evaluation of options on each step Temporal abstraction finds better approximation than static planner, with little more computation than SMDP planner

38 Outline/Steps Reflexes and their conditioning Learning to get reward Planning, by mental simulation Knowledge, as temporally flexible predictions Reason, as flexible use of knowledge

39 Reason Combining knowledge to obtain new knowledge, flexibly and generally We must be able to reason about any event as a possible (sub)goal, not just about rewards This is the final step

40 Subgoals Many natural macro-actions are goal-oriented –E.g., drive-to-work, open-the-door So replicate planning in-miniature for each subgoal Macros can then be learned to achieve each subgoal Many can be learned at once, independently –Solves classic problem of subgoal credit assignment –Solves psychological puzzle of goal-oriented action Models of such macros are goal-oriented recognizers –correspond to classical “concepts” –e.g., a “chair” state is one where sitting is predicted to work rooms example

41 Rooms Example Independent learning of all 8 Subgoals ,00040,00060,00080,000100,000 All 8 hallway macros and predictions are learned accurately and efficiently while actions are selected totally at random

42 Co-Existence of Hedonism and Exploration/Constructivism The ultimate goal is still reward Still one primary policy and set of values But many other policies, values, and predictions are learned not directly in service of reward Most time is spent in exploration and discovery, gaining knowledge rather than reward: –What possibilities does the world afford? –How can I control and predict it in a variety of ways? –What concepts can be learned that might help later? From hedonism to curiosity and constructivism

43 Main Claims Mind is about predictions –making predictions –discovering what predictions can be made Knowledge is predictions –action-contingent and temporally-flexible predictions –agent-centric, grounded in experience from the bottom up The mind’s ultimate goal is to make reward- maximizing decisions –but most of its effort is devoted to subgoal of prediction A few simple mechanisms enable working flexibly with predictions –TD learning and Bellman backups Prediction Semantics

44 What is New? The formalization of macro-actions –provide temporal abstraction –as well as action contingency (experiments) –mesh seemlessly with learning and planning methods Using the goal-oriented machinery of RL –for knowledge construction –for perceptual concepts Taking the discipline of predictive knowledge seriously –speaking only in terms of the subjective, experiential data

45 Should Knowledge be Experiential? Allowing only Predictions in terms of Data? loses Expressiveness –can’t talk about objects, space, people; no “is-a” or “part-of” External (human) coherence –verbal labels, interpretability, explainability, calibration –the “shortcut” of entering knowledge directly into the agent gains The knowledge will have meaning to the machine It can be mechanically learned/verified/extended It will be suited for a general reasoning processes –composition and backup of predictions to yield new predictions


Download ppt "From Reflex to Reason Rich Sutton AT&T Labs with thanks to Satinder Singh, Doina Precup, and Andy Barto."

Similar presentations


Ads by Google