CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: IV 4/19/2007 Srini Narayanan – ICSI and UC Berkeley.

CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: IV 4/19/2007 Srini Narayanan – ICSI and UC Berkeley

Announcements  Othello tournament rules up.  On-line readings for this week.

Combining DP and MC TTTTTTTTTT

Model-Free Learning  Big idea: why bother learning T?  Update each time we experience a transition  Frequent outcomes will contribute more updates (over time)  Temporal difference learning (TD)  Policy still fixed!  Move values toward value of whatever successor occurs a s s, a s,a,s’ s’

Example: Passive TD Take  = 1,  = 0.5 (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done)

TD Learning features  On-line, Incremental  Bootstrapping (like DP unlike MC)  Model free  Converges for any policy to the correct value of a state for that policy.  On average when alpha is small  With probability 1 when alpha is high in the beginning and low at the end (say 1/k)

Driving Home Changes recommended by Monte Carlo methods  =1) Changes recommended by TD methods (  =1)

Problems with TD Value Learning  TD value learning is model- free for policy evaluation  However, if we want to turn our value estimates into a policy, we’re sunk:  Idea: Learn state-action pairings (Q-values) directly  Makes action selection model-free too! a s s, a s,a,s’ s’

Q-Functions  A q-value is the value of a (state and action) under a policy  Utility of taking starting in state s, taking action a, then following  thereafter

The Bellman Equations  Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy  Formally:

Q-Learning  Learn Q*(s,a) values  Receive a sample (s,a,s’,r)  Consider your old estimate:  Consider your new sample estimate:  Nudge the old estimate towards the new sample:

Recap: Q-Learning  Learn Q*(s,a) values from samples  Receive a sample (s,a,s’,r)  On one hand: old estimate of return:  But now we have a new estimate for this sample:  Nudge the old estimate towards the new sample:  Equivalently, average samples over time:

Q-Learning

Exploration / Exploitation  Several schemes for forcing exploration  Simplest: random actions (  -greedy)  Every time step, flip a coin  With probability , act randomly  With probability 1- , act according to current policy (best q value for instance)  Problems with random actions?  You do explore the space, but keep thrashing around once learning is done  One solution: lower  over time  Another solution: exploration functions

Q-Learning  Q-learning produces tables of q-values:

Demo of Q Learning  Demo arm-control  Parameters  learning rate  discounted reward (high for future rewards)  exploration(should decrease with time)  MDP  Reward= number of the pixel moved to the right/ iteration number  Actions : Arm up and down (yellow line), hand up and down (red line)

Q-Learning Properties  Will converge to optimal policy  If you explore enough  If you make the learning rate small enough  Neat property: does not learn policies which are optimal in the presence of action selection noise SE SE

Exploration Functions  When to explore  Random actions: explore a fixed amount  Better idea: explore areas whose badness is not (yet) established  Exploration function  Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important)

Q-Learning  In realistic situations, we cannot possibly learn about every single state!  Too many states to visit them all in training  Too many states to even hold the q-tables in memory  Instead, we want to generalize:  Learn about some small number of training states from experience  Generalize that experience to new, similar states  This is a fundamental idea in machine learning  Clustering, classification (unsupervised, supervised), non- parametric learning

Evaluation Functions  Function which scores non-terminals  Ideal function: returns the utility of the position  In practice: typically weighted linear sum of features:  e.g. f 1 ( s ) = (num white queens – num black queens), etc.

Function Approximation  Problem: inefficient to learn each state’s utility (or eval function) one by one  Solution: what we learn about one state (or position) should generalize to similar states  Very much like supervised learning  If states are treated entirely independently, we can only learn on very small state spaces

Linear Value Functions  Another option: values are linear functions of features of states (or action-state pairs)  Good if you can describe states well using a few features (e.g. for game playing board evaluations)  Now we only have to learn a few weights rather than a value for each state 0.60 0.70 0.800.85 0.650.70 0.80 0.90 0.75 0.85 0.95

TD Updates for Linear Qs  Can use TD learning with linear Qs  (Actually it’s just like the perceptron!)  Old Q-learning update:  Simply update weights of features in Q  (a,s)

Generalization of Q-functions  Non-linear Q functions are required for more complex spaces. Such functions can be learnt using  Multi-Layer Perceptrons (TD-gammon)  Support Vector Machines  Non-Parametric Methods

Demo: Learning walking controllers  (From Stanford AI Lab)

Policy Search

 Problem: often the feature-based policies that work well aren’t the ones that approximate V / Q best  E.g. the value functions from the gridworld were probably bad estimates of future rewards, but they could still produce good decisions  Solution: learn the policy that maximizes rewards rather than the value that predicts rewards  This is the idea behind policy search, such as what controlled the upside-down helicopter

Policy Search  Simplest policy search:  Start with an initial linear value function or q-function  Nudge each feature weight up and down and see if your policy is better than before  Problems:  How do we tell the policy got better?  Need to run many sample episodes!  If there are a lot of features, this can be impractical

Policy Search*  Advanced policy search:  Write a stochastic (soft) policy:  Turns out you can efficiently approximate the derivative of the returns with respect to the parameters w (details in the book, but you don’t have to know them)  Take uphill steps, recalculate derivatives, etc.

Helicopter Control (Andrew Ng)

Neural Correlates of RL Parkinson’s Disease  Motor control + initialtion? Intracranial self-stimulation; Drug addiction; Natural rewards  Reward pathway?  Learning? Also involved in:  Working memory  Novel situations  ADHD  Schizophrenia ……

Conditioning Ivan Pavlov = Conditional stimulus = Unconditional stimulus Response = Unconditional response (reflex); conditional response (reflex)

Dopamine Levels track RL signals Unpredicted reward (unlearned/no stimulus) Predicted reward (learned task) Omitted reward (probe trial) (Montague et al. 1996)

Current Hypothesis Phasic dopamine encodes a reward prediction error  Precise (normative!) theory for generation of DA firing patterns  Compelling account for the role of DA in classical conditioning: prediction error acts as signal driving learning in prediction areas  Evidence  Monkey single cell recordings  Human fMRI studies  Current Research  Better information processing model  Other reward/punishment circuits including Amygdala (for visual perception)  Overall circuit (PFC-Basal Ganglia interaction)

Reinforcement Learning  What you should know  MDPs  Utilities, discounting  Policy Evaluation  Bellman’s equation  Value iteration  Policy iteration  Reinforcement Learning  Adaptive Dynamic Programming  TD learning (Model-free)  Q Learning  Function Approximation

Hierarchical Learning

 Stratagus: Example of a large RL task, from Bhaskara Marthi’s thesis (w/ Stuart Russell)  Stratagus is hard for reinforcement learning algorithms  > 10 100 states  > 10 30 actions at each point  Time horizon ≈ 10 4 steps  Stratagus is hard for human programmers  Typically takes several person-months for game companies to write computer opponent  Still, no match for experienced human players  Programming involves much trial and error  Hierarchical RL  Humans supply high-level prior knowledge using partial program  Learning algorithm fills in the details Hierarchical RL

(defun top () (loop (choose (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun gather-gold () (with-choice (dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice (move ‘(N S E W NOOP)) (action move)))) Partial “Alisp” Program

Hierarchical RL  They then define a hierarchical Q-function which learns a linear feature-based mini-Q-function at each choice point  Very good at balancing resources and directing rewards to the right region  Still not very good at the strategic elements of these kinds of games (i.e. the Markov game aspect) [DEMO]

CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: IV 4/19/2007 Srini Narayanan – ICSI and UC Berkeley.

Similar presentations

Presentation on theme: "CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: IV 4/19/2007 Srini Narayanan – ICSI and UC Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: IV 4/19/2007 Srini Narayanan – ICSI and UC Berkeley.

Similar presentations

Presentation on theme: "CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: IV 4/19/2007 Srini Narayanan – ICSI and UC Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback