CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning Convergence

CS 5751 Machine Learning Chapter 13 Reinforcement Learning2 Control Learning Consider learning to choose actions, e.g., Robot learning to dock on battery charger Learning to choose actions to optimize factory output Learning to play Backgammon Note several problem characteristics Delayed reward Opportunity for active exploration Possibility that state only partially observable Possible need to learn multiple tasks with same sensors/effectors

CS 5751 Machine Learning Chapter 13 Reinforcement Learning3 One Example: TD-Gammon Tesauro, 1995 Learn to play Backgammon Immediate reward +100 if win -100 if lose 0 for all other states Trained by playing 1.5 million games against itself Now approximately equal to best human player

CS 5751 Machine Learning Chapter 13 Reinforcement Learning4 Reinforcement Learning Problem Environment Agent state action reward s0s0 r0r0 a0a0 s1s1 r1r1 a1a1 s2s2 r2r2 a2a2... Goal: learn to choose actions that maximize r 0 +  r 1 +  2 r 2 + …, where 0   < 1

CS 5751 Machine Learning Chapter 13 Reinforcement Learning5 Markov Decision Process Assume finite set of states S set of actions A at each discrete time, agent observes state s t  S and choose action a t  A then receives immediate reward r t and state changes to s t+1 Markov assumption: s t+1 =  (s t, a t ) and r t = r(s t, a t ) –i.e., r t and s t+1 depend only on current state and action –functions  and r may be nondeterministic –functions  and r no necessarily known to agent

CS 5751 Machine Learning Chapter 13 Reinforcement Learning6 Agent’s Learning Task Execute action in environment, observe results, and learn action policy  : S  A that maximizes E[r t +  r t+1 +  2 r t+2 + …] from any starting state in S here 0   < 1 is the discount factor for future rewards Note something new: target function is  : S  A but we have no training examples of form training examples are of form,r>

CS 5751 Machine Learning Chapter 13 Reinforcement Learning7 Value Function To begin, consider deterministic worlds … For each possible policy  the agent might adopt, we can define an evaluation function over states where r t,r t+1,… are generated by following policy  starting at state s Restated, the task is to learn the optimal policy  *

CS 5751 Machine Learning Chapter 13 Reinforcement Learning8 G 100 00 0 0 00 0 0 0 0 0 r(s,a) (immediate reward) values G 100 90 0 Q(s,a) values 90 72 81 G 100 90 0 V*(s) values 90 81 G One optimal policy

CS 5751 Machine Learning Chapter 13 Reinforcement Learning9 What to Learn We might try to have agent learn the evaluation function V  * (which we write as V*) We could then do a lookahead search to choose best action from any state s because A problem: This works well if agent knows a  : S  A  S, and r : S  A   But when it doesn’t, we can’t choose actions this way

CS 5751 Machine Learning Chapter 13 Reinforcement Learning10 Q Function Define new function very similar to V* If agent learns Q, it can choose optimal action even without knowing d! Q is the evaluation function the agent will learn

CS 5751 Machine Learning Chapter 13 Reinforcement Learning11 Training Rule to Learn Q Note Q and V* closely related: Which allows us to write Q recursively as Let denote learner’s current approximation to Q. Consider training rule where s' is the state resulting from applying action a in state s

CS 5751 Machine Learning Chapter 13 Reinforcement Learning12 Q Learning for Deterministic Worlds For each s,a initialize table entry Observe current state s Do forever: Select an action a and execute it Receive immediate reward r Observe the new state s' Update the table entry for as follows: s  s'

CS 5751 Machine Learning Chapter 13 Reinforcement Learning13 Updating 100 initial state: s 1 72 63 81 R 100 next state: s 2 90 63 81 R a right

CS 5751 Machine Learning Chapter 13 Reinforcement Learning14 Convergence converges to Q. Consider case of deterministic world where each visited infinitely often. Proof: define a full interval to be an interval during which each is visited. During each full interval the largest error in table is reduced by factor of  Let be table after n updates, and  n be the maximum error in ; that is

CS 5751 Machine Learning Chapter 13 Reinforcement Learning15 Convergence (cont) For any table entry updated on iteration n+1, the error in the revised estimate is

CS 5751 Machine Learning Chapter 13 Reinforcement Learning16 Nondeterministic Case What if reward and next state are non-deterministic? We redefine V,Q by taking expected values

CS 5751 Machine Learning Chapter 13 Reinforcement Learning17 Nondeterministic Case Q learning generalizes to nondeterministic worlds Alter training rule to where Can still prove converge of to Q [Watkins and Dayan, 1992]

CS 5751 Machine Learning Chapter 13 Reinforcement Learning18 Temporal Difference Learning Q learning: reduce discrepancy between successive Q estimates One step time difference: Why not two steps? Or n? Blend all of these:

CS 5751 Machine Learning Chapter 13 Reinforcement Learning19 Temporal Difference Learning Equivalent expression: TD( ) algorithm uses above training rule Sometimes converges faster than Q learning converges for learning V* for any 0   1 (Dayan, 1992) Tesauro’s TD-Gammon uses this algorithm

CS 5751 Machine Learning Chapter 13 Reinforcement Learning20 Subtleties and Ongoing Research Replace table with neural network or other generalizer Handle case where state only partially observable Design optimal exploration strategies Extend to continuous action, state Learn and use d : S  A  S, d approximation to  Relationship to dynamic programming

CS 5751 Machine Learning Chapter 13 Reinforcement Learning21 RL Summary Reinforcement learning (RL) –control learning –delayed reward –possible that the state is only partially observable –possible that the relationship between states/actions unknown Temporal Difference Learning –learn discrepancies between successive estimates –used in TD-Gammon V(s) - state value function –needs known reward/state transition functions

CS 5751 Machine Learning Chapter 13 Reinforcement Learning22 RL Summary Q(s,a) - state/action value function –related to V –does not need reward/state trans functions –training rule –related to dynamic programming –measure actual reward received for action and future value using current Q function –deterministic - replace existing estimate –nondeterministic - move table estimate towards measure estimate –convergence - can be shown in both cases

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Similar presentations

Presentation on theme: "CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Similar presentations

Presentation on theme: "CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning."— Presentation transcript:

Similar presentations

About project

Feedback