Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reinforcement Learning

Similar presentations


Presentation on theme: "Reinforcement Learning"— Presentation transcript:

1 Reinforcement Learning
Michael L. Littman Slides from which appear to have been adapted from Machine Learning RL

2 Reinforcement Learning
ML 10 02-01 Reinforcement Learning [Read Ch. 13] [Exercise 13.2] Control learning Control policies that choose optimal actions Q learning Convergence Machine Learning RL Elena Marchiori

3 Control Learning Consider learning to choose actions, e.g.,
ML 10 02-01 Control Learning Consider learning to choose actions, e.g., Robot learning to dock on battery charger Learning to choose actions to optimize factory output Learning to play Backgammon Note several problem characteristics: Delayed reward Opportunity for active exploration Possibility that state only partially observable Possible need to learn multiple tasks with same sensors/effectors Machine Learning RL Elena Marchiori

4 One Example: TD-Gammon
ML 10 02-01 One Example: TD-Gammon [Tesauro, 1995] Learn to play Backgammon. Immediate reward: +100 if win -100 if lose 0 for all other states Trained by playing 1.5 million games against itself. Now approximately equal to best human player. Machine Learning RL Elena Marchiori

5 Reinforcement Learning Problem
ML 10 02-01 Reinforcement Learning Problem Goal: learn to choose actions that maximize r0 + r1 + 2r2 + … , where 0   < 1 Machine Learning RL Elena Marchiori

6 Markov Decision Processes
ML 10 Markov Decision Processes 02-01 Assume finite set of states S set of actions A at each discrete time agent observes state st  S and chooses action at  A then receives immediate reward rt and state changes to st+1 Markov assumption: st+1 = (st, at) and rt = r(st, at) i.e., rt and st+1 depend only on current state and action functions  and r may be nondeterministic functions  and r not necessarily known to agent Machine Learning RL Elena Marchiori

7 Agent’s Learning Task ML 10 02-01 Execute actions in environment, observe results, and learn action policy  : S  A that maximizes E [rt + rt+1 + 2rt+2 + … ] from any starting state in S here 0   < 1 is the discount factor for future rewards (sometimes makes sense with  = 1) Note something new: Target function is  : S  A But, we have no training examples of form s, a Training examples are of form s, a , r Machine Learning RL Elena Marchiori

8 Value Function To begin, consider deterministic worlds...
ML 10 02-01 To begin, consider deterministic worlds... For each possible policy  the agent might adopt, we can define an evaluation function over states where rt, rt+1, ... are generated by following policy  starting at state s Restated, the task is to learn the optimal policy * Machine Learning RL Elena Marchiori

9 ML 10 02-01 Machine Learning RL Elena Marchiori

10 What to Learn ML 10 02-01 We might try to have agent learn the evaluation function V* (which we write as V*) It could then do a look-ahead search to choose the best action from any state s because A problem: This can work if agent knows  : S  A  S, and r : S  A   But when it doesn't, it can't choose actions this way Machine Learning RL Elena Marchiori

11 Q Function Define a new function very similar to V*
ML 10 02-01 Define a new function very similar to V* If agent learns Q, it can choose optimal action even without knowing ! Q is the evaluation function the agent will learn [Watkins 1989]. Machine Learning RL Elena Marchiori

12 Training Rule to Learn Q
ML 10 02-01 Note Q and V* are closely related: This allows us to write Q recursively as Nice! Let denote learner’s current approximation to Q. Consider training rule where s’ is the state resulting from applying action a in state s. Machine Learning RL Elena Marchiori

13 Q Learning for Deterministic Worlds
ML 10 02-01 For each s, a initialize table entry Observe current state s Do forever: Select an action a and execute it Receive immediate reward r Observe the new state s’ Update the table entry for as follows: s  s’. Machine Learning RL Elena Marchiori

14 Updating Q Notice if rewards non-negative, then and Machine Learning
ML 10 02-01 Notice if rewards non-negative, then and Machine Learning RL Elena Marchiori

15 Convergence Theorem ML 10 02-01 converges to Q. Consider case of deterministic world, with bounded immediate rewards, where each s, a visited infinitely often. Proof: Define a full interval to be an interval during which each s, a is visited. During each full interval the largest error in table is reduced by factor of . Let be table after n updates, and Δn be the maximum error in : that is For any table entry updated on iteration n+1, the error in the revised estimate is Machine Learning RL Elena Marchiori

16 Note we used general fact that:
ML 10 02-01 Note we used general fact that: This works with things other than max that satisfy this non-expansion property [Szepesvári & Littman, 1999]. Machine Learning RL Elena Marchiori

17 Non-deterministic Case (1)
ML 10 Non-deterministic Case (1) 02-01 What if reward and next state are non-deterministic? We redefine V and Q by taking expected values. Machine Learning RL Elena Marchiori

18 Nondeterministic Case (2)
ML 10 Nondeterministic Case (2) 02-01 Q learning generalizes to nondeterministic worlds Alter training rule to where Can still prove convergence of to Q [Watkins and Dayan, 1992]. Standard properties:  n = 0,  n2 = . Machine Learning RL Elena Marchiori

19 Temporal Difference Learning (1)
ML 10 Temporal Difference Learning (1) 02-01 Q learning: reduce discrepancy between successive Q estimates One step time difference: Why not two steps? Or n? Blend all of these: Machine Learning RL Elena Marchiori

20 Temporal Difference Learning (2)
ML 10 02-01 Temporal Difference Learning (2) Equivalent expression: TD() algorithm uses above training rule Sometimes converges faster than Q learning converges for learning V* for any 0    1 [Dayan, 1992] Tesauro's TD-Gammon uses this algorithm Machine Learning RL Elena Marchiori

21 Subtleties and Ongoing Research
ML 10 02-01 Subtleties and Ongoing Research Replace table with neural net or other generalizer Handle case where state only partially observable (next?) Design optimal exploration strategies Extend to continuous action, state Learn and use Relationship to dynamic programming (next?) Machine Learning RL Elena Marchiori


Download ppt "Reinforcement Learning"

Similar presentations


Ads by Google