Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning.

Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning Laboratory Department of Computer Science University of Massachusetts Amherst Barto@cs.umass.edu Searching in the Right Space

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Psychology Artificial Intelligence (machine learning) Control Theory and Operations Research Artificial Neural Networks Computational Reinforcement Learning (RL) Neuroscience Computational Reinforcement Learning “Reinforcement learning (RL) bears a tortuous relationship with historical and contemporary ideas in classical and instrumental conditioning.” —Dayan 2001

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 The Plan  High-level intro to RL  Part I: The personal odyssey  Part II: The modern view  Part III: Intrinsically Motivated RL

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 The View from Machine Learning  Unsupervised Learning recode data based on some given principle  Supervised Learning “Learning from examples”, “Learning with a teacher”, related to Classical (or Pavlovian) Conditioning  Reinforcement Learning “Learning with a critic”, related to Instrumental (or Thorndikian) Conditioning

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Classical Conditioning Tone (CS: Conditioned Stimulus) Food (US: Unconditioned Stimulus) Salivation (UR: Unconditioned Response) Anticipatory salivation (CR: Conditioned Response) Pavlov, 1927

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Edward L. Thorndike (1874-1949) puzzle box Learning by “Trial-and-Error”

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Trial-and-Error = Error Correction Artificial Neural Network: learns from a set of examples via error-correction

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Supervised Learning Inputs Outputs Training Info = desired (target) outputs Error = (target output – actual output) Supervised Learning System

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Reinforcement Learning RL System Inputs Outputs (“actions”) Training Info = evaluations (“scores”, “rewards”, “penalties”) Objective: get as much reward as possible!

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 “Least-Mean-Square” (LMS) Learning Rule input pattern desired output “delta rule”, Adaline, Widrow and Hoff, 1960 z + adjust weights actual output + – V x 2 x n x 1 w n w 1 w 2  w i   z  V  x i

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Trial-and-Error?  “The boss continually seeks a better worker by trial and error experimentation with the structure of the worker. Adaptation is a multidimensional performance feedback process. The `error’ signal in the feedback control sense is the gradient of the mean square error with respect to the adjustment.” Widrow and Hoff, “Adaptive Switching Circuits” 1960 IRE WESCON Conventional Record

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 MENACE Michie 1961 “ Matchbox Educable Noughts and Crosses Engine ” x x x x o o o o o x x x x o o o x x x o o o x x x x xx x x o o x o o x x x x x x o o o o o o o o o o o x x x o o o o o o o x x x o o x x x x o o o x o o o o o o o o o x x x x o o o o o x x x x o o o o o o o o x x x x o o o o o o o x x x x o o o o x x x x o o o o x x

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Essence of RL (for me at least!): Search + Memory  Search: Trial-and-Error, Generate-and- Test, Variation-and-Selection,...  Memory: remember what worked best for each situation and start from there next time RL is about caching search results (so you don’t have to keep searching!)

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Generate-and-Test  Generator should be smart: Generate lots things that are likely to be good based on prior knowledge and prior experience But also take chances …  Tester should be smart too: Evaluate based on real criteria, not convenient surrogates But be able to recognize partial success

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Key Players  Harry Klopf  Rich Sutton  Me

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Arbib, Kilmer, and Spinelli in Neural Mechanisms of Learning and Memory, Rosenzweig and Bennett, 1974 “Neural Models and Memory”

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 A. Harry Klopf “Brain Function and Adaptive Systems -- A Heterostatic Theory” Air Force Cambridge Research Laboratories Technical Report 3 March 1972 “…it is a theory which assumes that living adaptive systems seek, as their primary goal, a maximal condition (heterostasis), rather than assuming that the primary goal is a steady-state condition (homeostasis). It is further assumed that the heterostatic nature of animals, including man, derives from the heterostatic nature of neurons. The postulate that the neuron is a heterostat (that is, a maximizer) is a generalization of a more specific postulate, namely, that the neuron is a hedonist.”

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Klopf’s theory (very briefly!)  Inspiration: The nervous system is a society of self-interested agents. Nervous Systems = Social Systems Neuron = Man Man = Hedonist Neuron = Hedonist Depolarization = Pleasure Hyperpolarization = Pain  A neuronal model: A neuron “decides” when to fire based on comparing a spatial and temporal summation of weighted inputs with a threshold. A neuron is in a condition of heterostasis from time t to t+ if it maximizes the amount of depolarization and minimizes the amount of hyperpolarization over this interval. Two ways to adapt weights to do this: Push excitatory weights to upper limits; zero out inhibitory weights Make neuron control its input.

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Heterostatic Adaptation  When a neuron fires, all of its synapses that were active during the summation of potentials leading to the response become eligible to undergo changes in their transmittances.  The transmittances of an eligible excitatory synapse increases if the generation of an action potential is followed by further depolarization for a limited time after the response.  The transmittances of an eligible inhibitory synapse increases if the generation of an action potential is followed by further hyperpolarization for a limited time after the response.  Add a mechanism that prevents synapses that participate in the reinforcement from undergoing changes due to that reinforcement (“zerosetting”).

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Key Components of Klopf’s Theory  Eligibility  Closed-loop control by neurons  Extremization (e.g., maximization) as goal instead of zeroing something  “Generalized Reinforcement”: reinforcement is not delivered by a specialized channel The Hedonistic Neuron A Theory of Memory, Learning, and Intelligence A. Harry Klopf Hemishere Publishing Corporation 1982

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Eligibility Traces Klopf, 1972 Optimal ISI The same curve as the reinforcement- effectiveness curve in conditioning: max at 400 ms; 0 after approx 4 s. a histogram of the lengths of feedback pathways in which the neuron is embedded

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Later Simplified Eligibility Traces visits to state s TIME accumulating trace replace trace

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Rich Sutton  BA Psychology, Stanford, 1978  As an undergrad, discovered Klopf’s 1972 tech report  Two unpublished undergraduate reports: “Learning Theory Support for a Single Channel Theory of the Brain” 1978 “A Unified Theory of Expectation in Classical and Instrumental Conditioning” 1978 (?)  Rich’s first paper: “Single Channel Theory: A Neuronal Theory of Learning” Brain Theory Newsletter, 1978.

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Sutton’s Theory  Aj: level of activation of mode j at time t  Vij: sign and magnitude of association from mode i to mode j at time t  Eij: eligibility of Vij for undergoing changes at time t. It is proportional to the average of the product Ai(t)Aj(t) over some small past time interval (or an average of the logical AND).  Pj: expected level of activation of mode j at time t (a prediction of level of activation of mode j)  C ij a constant depending on particular association being changed

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 What exactly is P j ?  Based on recent activation of the mode: The higher the activation within the last few seconds, the higher the level expected for the present...  P j (t) is proportional to the average of the activation level over some small time interval (a few seconds or less) before t.

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Sutton’s theory  Contingent Principle: based on reinforcement a neuron receives after firings and the synapses which were involved in the firings, the neuron modifies its synapses so that they will cause it to fire when the firing causes an increase in the neuron’s expected reinforcement after the firing. Basis of Instrumental, or Thorndikian, conditioning  Predictive Principle: if a synapse’s activity predicts (frequently precedes) the arrival of reinforcement at the neuron, then that activity will come to have an effect on the neuron similar to that of reinforcement. Basis of Classical, or Pavlovian, conditioning

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Sutton’s Theory  Main addition to Kopf’s theory: addition of the difference term — a temporal difference term  Showed relationship to the Rescorla-Wagner model (1972) of Classical Conditioning Blocking Overshadowing  Sutton’s model was a real-time model of both classical and instrumental conditioning  Emphasized conditioned reinforcement

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Rescorla Wagner Model, 1972  change in associative strength of CS A  : parameter related to CS intensity : parameter related to US intensity : sum of associative strengths of all CSs present (“composite expectation”) “Organisms only learn when events violate their expectations.” A “trial-level” model

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Conditioned Reinforcement Tone Food Phase I: Light Tone Food Phase II: i.e., pair light with prediction of food but not with food itself.

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Conditioned Reinforcement  Stimuli associated with reinforcement take on reinforcing properties themselves  Follows immediately from the predictive principle: “By the predictive principle we propose that the neurons of the brain are learning to have predictors of stimuli have the same effect on them as the stimuli themselves” (Sutton, 1978)  “In principle this chaining can go back for any length …” (Sutton, 1978)  Equated Pavlovian conditioned reinforcement with instrumental higher-order conditioning

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Where was I coming from?  Studied at the University of Michigan: at the time a hotbed of genetic algorithm activity due to John Holland’s influence (PhD in 1975)  Holland talked a lot about the exploration/exploitation tradeoff  But I studied dynamic system theory, relationship between state-space and input/output representations of systems, convolution and harmonic analysis, finally cellular automata  Fascinated by how simple local rules can generate complex global behavior: Dynamic systems Cellular automata Self-organization Neural networks Evolution Learning

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Sutton and Barto, 1981  “Toward a Modern Theory of Adaptive Networks: Expectation and Prediction” Psych Review 88, 1981  Drew on Rich’s earlier work, but clarified the math and simplified the eligibility term to be non-contingent: just a trace of x instead of xy.  Emphasized anticipitory nature of CR  Related to “Adaptive System Theory”: Other neural models (Hebb, Widrow & Hoff’s LMS, Uttley’s “Informon”, Anderson’s associative memory networks) Pointed out relationship between Rescorla-Wagner model and Adaline, or LMS algorithm Studied algorithm stability Reviewed possible neural mechanisms: e.g. eligibility = intracellular Ca ion concentrations

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 “SB Model” of Classical Conditioning

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Temporal Primacy Overrides Blocking in SB model Kehoe, Schreurs, and Graham 1987 our simulation

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Intratrial Time Courses (part 2 of blocking)

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Adaline Learning Rule input pattern target output LMS rule, Widrow and Hoff, 1960

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 “Rescorla–Wagner Unit” CS US to CR “composite expectation” to UR vector of “associative strengths”

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Important Notes  The “target output” of LMS corresponds to the US input to Rescorla-Wagner model  In both cases, this input is specialized in that it does not directly activate the unit but only directs learning  The SB model is different, with the US input activating the unit and directing learning  Hence, SB model can do secondary reinforcement  SB model stayed with Klopf’s idea of “generalized reinforcement”

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 One-Step-Ahead LMS Predictor input pattern z + adjust weights prediction + – y x 2 x n x 1  the quantity predicted

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 One Neural Implementation of S-B Model

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 A Major Problem: US offset  e.g., if a CS has same time course as US, weights would change so US will be cancelled out. US CS Final result Why? Because trying to zero out y t – y t–1

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Associative Memory Networks Kohonen et al. 1976, 1977; Anderson et al. 1977

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Associative Search Network Barto, Sutton, & Brouwer 1981

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Associative Search Network Barto, Sutton, Brouwer, 1981 Problem of context transitions: add a predictor “one-step-ahead LMS predictor”

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Relation to Klopf/Sutton Theory Did not include generalized reinforcement since z(t) is a specialized reward input Associative version of the ALOPEX algorithm of Harth & Tzanakou, and later Unnikrishnan

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Associative Search Network

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 “Landmark Learning” Barto & Sutton 1981 An illustration of associative search

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 “Landmark Learning”

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 “Landmark Learning” swap E and W landmarks

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Note: Diffuse Reward Signal x 1 x 2 x 3 y 1 y 2 y 3 reward Units can learn different things despite receiving identical inputs...

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Provided there is variability  ASN just used noisy units to introduce variability  Variability drives the search  Needs to have an element of “blindness”, as in “blind variation”: i.e. outcome is not completely known beforehand  BUT does not have to be random  IMPORTANT POINT: Blind Variation does not have to be random, or dumb

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Pole Balancing Widrow & Smith, 1964 “Pattern Recognizing Control Systems” Michie & Chambers, 1968 “Boxes: An Experiment in Adaptive Control” Barto, Sutton, & Anderson 1984

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 MENACE Michie 1961 “ Matchbox Educable Noughts and Crosses Engine ” x x x x o o o o o x x x x o o o x x x o o o x x x x xx x x o o x o o x x x x x x o o o o o o o o o o o x x x o o o o o o o x x x o o x x x x o o o x o o o o o o o o o x x x x o o o o o x x x x o o o o o o o o x x x x o o o o o o o x x x x o o o o x x x x o o o o x x

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 The “Boxes” Idea  “Although the construction of this Matchbox Educable Noughts and Crosses Engine (Michie 1961, 1963) was undertaken as a ‘fun project’, there was present a more serious intention to demonstrate the principle that it may be easier to learn to play many easy games than one difficult one. Consequently it may be advantageous to decompose a game into a number of mutually independent sub-games even if much relevant information is put out of reach in the process.” Michie and Chambers, “Boxes: An Experiment in Adaptive Control” Machine Intelligence 2, 1968

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Boxes

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Actor-Critic Architecture ACE = adaptive critic element ASE = associative search element

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 The Actor ASE: associative search element Note: 1)Move from changes in evaluation to just r 2)Move from y to just y in eligibility.

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 The Critic ACE: adaptive critic element Note differences with SB model: 1)Reward has been pulled out of the weighted sum 2) Discount factor  : decay rate of predictions if not sustained by external reinforcement

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Putting them Together lower reward prediction higher reward prediction action make taking action y in state s more likely s s’ y p(s)p(s)p(s’)

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Actor & Critic learning almost identical Actor Adaptive Critic + + – p + noise  r action prediction primary reward  w   e e = trace of presynaptic activity only e = trace of pre- and postsynaptic correlation e e

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 “Credit Assignment Problem” Spatial Temporal Getting useful training information to the right places at the right times Marvin Minsky, 1961

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 “Associative Reward-Penalty Element” (A R-P ) Barto & Anandan 1985 (same as ASE)

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 A R-P  If = 0, “Associative Reward-Inaction Element” A R-I  Think of r(t)y(t) as desired response  Stochastic version of Widrow et al.’s “Selective Boostrap Element” [Widrow, Gupta, & Maitra “Reward/Punish: learning with a critic in adaptive threshold systems”, 1973]  Associative generalization of L R-P, a “stochastic Learning automaton” algorithm (with roots in Tsetlin’s work and in mathematical psychology, e.g., Bush & Mosteller, 1955) Where we got the term “Critic”

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 A R-P Convergence Theorem  Input patterns linearly independent  Each input has nonzero probability of being presented on a trial  NOISE has cumulative distribution that is strictly monotonically increasing (excludes uniform dist. and deterministic case)   has to decrease as usual….  For all stochastic reward contingencies, as approaches 0, the probability of each correct action approaches 1.  BUT, does not work when  = 0.

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Contingency Space: 2 actions (two-armed bandit) Explore/Exploit Dilemma

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Interesting follow up to A R-P theorem  Williams’ REINFORCE class of algorithms (1987) generalizes A R-I (i.e., = 0).  He showed that the weights change according to an unbiased estimate of the gradient of the reward function  BUT NOTE: this is the case for which our theorem isn’t true!  Recent “policy gradient” methods generalize REINFORCE algorithms

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Learning by Statistical Cooperation Barto 1985 Feedforward networks of A R-P units Most reward achieved when the network implements the identity map each unit has an (unshown) constant input

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Identity Network Results =.04

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 XOR Network Most reward achieved when the network implements XOR

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 XOR Network Behavior Visible element Hidden element =.08

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Notes on A R-P Nets  None of these networks work with = . They almost always converge to a local maximum.  Elements face non-stationary reward contingencies; they have to converge for all contingencies, even hard ones.  Rumelhart, Hinton, & Williams published the backprop paper shortly after this (in 1986).  A R-P networks and backprop networks do pretty much the same thing, BUT backprop is much faster.  Barto & Jordan “Gradient Following without Backpropagation in Layered Networks” First IEEE conference on NNs, 1987.

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 On the speeds of various layered network algorithms  Backprop: slow  Boltzmann Machine: glacial  Reinforcement Learning: don’t ask! My recollection of a talk by Geoffrey Hinton c. 1988

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 “Credit Assignment Problem” Spatial Temporal Getting useful training information to the right places at the right times Marvin Minsky, 1961

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Teams of Learning Automata  Tsetlin, M. L. Automata Theory and Modeling of Biological Systems, Academic Press NY, 1973  e.g. the “Goore Game”  Real games were studied too…

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Neurons and Bacteria  Koshland’s (1980) model of bacterial tumbling  Barto (1989) “From Chemotaxis to Cooperativity: Abstract Exercises in Neuronal Learning Strategies” in The Computing Neuron, Durbin, Miall, & Mitchison (eds.), Addison-Wesley, Workingham England

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 TD Model of Pavlovian Conditioning  The adaptive critic (slightly modified) as a model of Pavlovian conditioning Sutton & Barto 1990 “floor” US instead of r

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 TD Model Predictions of what? “imminence weighted sum of future USs” i.e. discounting

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 TD Model “Complete Serial Compound”... “tapped delay line”

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Summary of Part I Eligibility Neurons as closed-loop controllers Generalized reinforcement Prediction Real-time conditioning models Conditioned reinforcement Adaptive system/machine learning theory Stochastic search Associative Reinforcement Learning Teams of self-interested units

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Key Computational Issues  Trial-and-error Error-Correction  Essence of RL (for me): search + memory  Variability is essential  Variability needs to be somewhat blind but not dumb  Smart generator; smart tester  The “Boxes Idea”: break up large search into many small searches  Prediction is important  What to predict: total future reward  Changes in prediction are useful local evaluations  Credit assignment problems

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Part II: The Modern View  Shift from animal learning to sequential decision problems: stochastic optimal control  Markov Decision Processes (MDPs)  Dynamic Programming (DP)  RL as approximate DP  Give up the neural models…

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Samuel’s Checkers Player 1959 CURRENT BOARD EVALUATION FUNCTION (Value Function) +20 V

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Arthur L. Samuel “... we are attempting to make the score, calculated for the current board position, look like that calculated for the terminal board positions of the chain of moves which most probably occur during actual play.” Some Studies in Machine Learning Using the Game of Checkers, 1959

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 TD-Gammon Start with a random network Play very many games against self Learn a value function from this simulated experience This produces (arguably) the best player in the world Value = estimated prob. of winning Tesauro, 1992–1995 STATES: configurations of the playing board (about 10 ) ACTIONS: moves REWARDS: win: +1 lose: 0 20

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Sequential Decision Problems  Decisions are made in stages.  The outcome of each decision is not fully predictable but can be observed before the next decision is made.  The objective is to maximize a numerical measure of total reward over the entire sequence of stages: called the return  Decisions cannot be viewed in isolation: need to balance desire for immediate reward with possibility of high reward in the future.

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 The Agent-Environment Interface t... s t a r t +1 s a r t +2 s a r t +3 s... t +3 a

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Markov Decision Processes  If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP).  If state and action sets are finite, it is a finite MDP.  To define a finite MDP, you need to give: state and action sets one-step “dynamics” defined by transition probabilities: reward expectations:

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Elements of the MDP view  Policies  Return: e.g, discounted sum of future rewards  Value functions  Optimal value functions  Optimal policies  Greedy policies  Models: probability models, sample models  Backups  Etc.

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Backups

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Needs a probability model to compute all the required expected values Stochastic Dynamic Programming

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 = update the value of each state once using the max backup Lookup–table storage of a SWEEP e.g., Value Iteration

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Dynamic Programming Bellman 195? “… it’s impossible to to use the word, dynamic, in a pejorative sense. Try thinking of some combination which will possibly give it a pejorative meaning. It’s impossible. … It was something not even a Congressman could object to.” Bellman

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Stochastic Dynamic Programming COMPUTATIONALLY COMPLEX: Multiple exhaustive sweeps Complex "backup" operation Complete storage of evaluation function, NEEDS ACCURATE PROBABILITY MODEL

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Approximating Stochastic DP  AVOID EXHAUSTIVE SWEEPS OF STATE SET To which states should the backup operation be applied?  SIMPLIFY THE BACKUP OPERATION Can one avoid evaluating all possible next states in each backup operation?  REDUCE DEPENDENCE ON MODELS What if details of process are unknown or hard to quantify?  COMPACTLY APPROXIMATE V Can one avoid explicitly storing all of V ?

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Avoiding Exhaustive Sweeps  Generate multiple sample paths: in reality or with a simulation (sample) model  FOCUS backups around sample paths  Accumulate results in V

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Simplifying Backups

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 no probability model needed real or simulated experience relatively efficient on very large problems Simple Monte Carlo

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 no probability model needed real or simulated experience incremental but less informative than a DP backup Temporal Difference Backup

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 to get: TD error Rewrite this Our familiar TD error

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Why TD? Loss Win Bad New 90% 10%

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Function Approximation Methods: e.g., artificial neural networks ANN description of state evaluation of Compactly Approximate V

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 expected return for taking action in state and following an optimal policy thereafter Let current estimate of For any state, any action with a maximal optimal action value is an optimal action: ( optimal action in ) action values Q-Learning Watkins 1989; Leigh Tesfatsion

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Does not need a probability model (for either learning or performance) The Q-Learning Backup

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Another View: Temporal Consistency  V t  r t  1  r t  2  r t  3    V t  1  r t  r t  1  r t  2   so: V t  1  r t  V t or: r t  V t  V t  1  0 “TD error”

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Review  MDPs  Dynamic Programming  Backups  Bellman equations (temporal consistency)  Approximating DP Avoid exhaustive sweeps Simplify backups Reduce dependence on models Compactly approximate V  A good case can be made for using RL to approximate solutions to large MDPs

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Environment action state reward Agent A Common View

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 external sensations memory state reward actions internal sensations RL agent A Less Misleading Agent View…

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Motivation  “Forces” that energize an organism to act and that direct its activity.  Extrinsic Motivation: being moved to do something because of some external reward ($$, a prize, etc.).  Intrinsic Motivation: being moved to do something because it is inherently enjoyable.

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Intrinsic Motivation  An activity is intrinsically motivated if the agent does it for its own sake rather than as a step toward solving a specific problem  Curiosity, Exploration, Manipulation, Play, Learning itself...  Can an artificial learning system be intrinsically motivated?  Specifically, can a Reinforcement Learning system be intrinsically motivated? Working with Satinder Singh

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 The Usual View of RL Reward looks extrinsic

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 The Less Misleading View All reward is intrinsic.

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 So What is IMRL?  Key distinction: Extrinsic reward = problem specific Intrinsic reward = problem independent  Learning phases: Developmental Phase: gain general competence Mature Phase: learn to solve specific problems  Why important: open-ended learning via hierarchical exploration

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Scaling Up: Abstraction  Ignore irrelevant details Learn and plan at a higher level Reduce search space size Hierarchical planning and control Knowledge transfer Quickly react to new situations c.f. macros, chunks, skills, behaviors,...  Temporal abstraction: ignore temporal details (as opposed to aggregating states)

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 The “Macro” Idea  A sequence of operations with a name; can be invoked like a primitive operation Can invoke other macros... hierarchy But: an open-loop policy  Closed-loop macros A decision policy with a name; can be invoked like a primitive control action behavior (Brooks, 1986), skill (Thrun & Schwartz, 1995), mode (e.g., Grudic & Ungar, 2000), activity (Harel, 1987), temporally-extended action, option (Sutton, Precup, & Singh, 1997)

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Options ( Precup, Sutton, & Singh, 1997) A generalization of actions to include temporally-extended courses of action Example: robot docking  : pre-defined controller  : terminate when docked or charger not visible I : all states in which charger is in sight

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Options cont.  Policies can select from a set of options & primitive actions  Generalizations of the usual concepts: Transition probabilities (“option models”) Value functions Learning and planning algorithms  Intra-option off-policy learning: Can simultaneously learn policies for many options from same experience

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Discrete time Homogeneous discount Continuous time Discrete events Interval-dependent discount Discrete time Overlaid discrete events Interval-dependent discount A discrete-time SMDP overlaid on an MDP Can be analyzed at either level Options over MDP State Time Options define a Semi-Markov Decision Process

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Where do Options come from?  Dominant approach: hand-crafted from the start  How can an agent create useful options for itself? Several different approaches (McGovern, Digney, Hengst, ….). All involve defining subgoals of various kinds.

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Canonical Illustration: Rooms Example HALLWAYS O 2 O 1 4 rooms 4 hallways 8 multi-step options Given goal location, quickly plan shortest route up down rightleft (to each room's 2 hallways) G? 4 unreliable primitive actions Fail 33% of the time Goal states are given a terminalvalue of 1  =.9 All rewards zero ROOM

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Task-Independent Subgoals  “Bottlenecks”, “Hubs”, “Access States”, …  Surprising events  Novel events  Incongruous events  Etc. …

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 A Developmental Approach  Subgoals: events that are “intrinsically interesting”; not in the service of any specific task  Create options to achieve them  Once option is well learned, the triggering event becomes less interesting  Previously learned options are available as actions in learning new option policies  When facing a specific problem: extract a “working set” of actions (primitive and abstract) for planning and learning

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 For Example:  Built-in salient stimuli: changes in lights and sounds  Intrinsic reward generated by each salient event: Proportional to the error in prediction of that event according to the option model for that event (“surprise”)  Motivated in part by novelty responses of dopamine neurons

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Creating Options  Upon first occurrence of salient event: create an option and initialize: Initiation set Policy Termination condition Option model  All options and option models updated all the time using intra-option learning

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 The Playroom Domain Agent has eye, hand, visual marker Actions: move eye to hand move eye to marker move eye N, S, E, or W move eye to random object move hand to eye move hand to marker move marker to eye move marker to hand If both eye and hand are on object: turn on light, push ball. etc.

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 The Playroom Domain cont. Switch controls room lights Bell rings and moves one square if ball hits it Press blue/red block turns music on/off Lights have to be on to see colors Can push blocks Monkey cries out if bell and music both sound in dark room

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Skills  To make monkey cry out: Move eye to switch Move hand to eye Turn lights on Move eye to blue block Move hand to eye Turn music on Move eye to switch Move hand to eye Turn light off Move eye to bell Move marker to eye Move eye to ball Move hand to ball Kick ball to make bell ring  Using skills (options) Turn lights on Turn music on Turn lights off Ring bell

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Reward for Salient Events

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Speed of Learning Various Skills

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Learning to Make the Monkey Cry Out

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Connects with Previous RL Work  Schmidhuber  Thrun and Moller  Sutton  Kaplan and Oudeyer  Duff  Others…. But these did not have the option framework and related algorithms available

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Beware the “Fallacy of Misplaced Concreteness” Alfred North Whitehead We have a tendency to mistake our models for reality, especially when they are good models.

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 Thanks to all my past PhD Students  Rich Sutton  Chuck Anderson  Stephen Judd  Robbie Jacobs  Jonathan Bachrach  Vijay Gullapalli  Satinder Singh  Bob Crites  Steve Bradtke  Mike Duff  Amy McGovern  Ted Perkins  Mike Rosenstein  Balaraman Ravindran

Autonomous Learning Laboratory – Department of Computer Science Andrew Barto, Okinawa Computational Neuroscience Course, July 2005 And my current students  Colin Barringer  Anders Jonsson  George D. Konidaris  Ashvin Shah  Özgür Şimşek  Andrew Stout  Chris Vigorito  Pippin Wolfe And the funding agencies AFOSR, NSF, NIH, DARPA

Whew! Thanks!

Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning.

Similar presentations

Presentation on theme: "Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning.

Similar presentations

Presentation on theme: "Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning."— Presentation transcript:

Similar presentations

About project

Feedback