Reinforcement Learning

Reinforcement Learning
Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Introduction Supervised Learning: Example Class
Reinforcement Learning: … Situation Reward Situation Reward

Examples Playing chess: Reward comes at end of game Ping-pong:
Reward on each point scored Animals: Hunger and pain - negative reward food intake – positive reward

Passive Learning We assume the policy Π is fixed.
In state s we always execute action Π(s) Rewards are given.

Typical Trials (1,1) -0.04  (1,2) -0.04  (1,3) -0.04 
(1,2)  (1,3) …  (4,3) +1 Goal: Use rewards to learn the expected utility UΠ (s)

Expected Utility UΠ (s) = E [ Σt=0 γ R(st) | Π, S0 = s ]
Expected sum of rewards when the policy is followed.

Example (1,1)  (1,2)  (1,3)  (2,3)  (3,3)  (4,3) +1 Total reward: (-0.04 x 5) + 1 = 0.80

Direct Utility Estimation
Convert the problem to a supervised learning problem: (1,1)  U = 0.72 (2,1)  U = 0.68 … Learn to map states to utilities. But utilities are not independent of each other!

Bellman Equations Utility values obey the following equations:
UΠ (s) = R(s) + γ Σs’ T(s,s’) UΠ (s’) Can be solved using dynamic programming. Assumes knowledge of model.

Temporal Difference Learning
Use the following update rule: UΠ (s)  UΠ (s) + α [ R(s) + γ UΠ (s’) - UΠ (s) ] α is the learning rate Temporal difference equation. No model assumption.

Example U(1,3) = 0.84 U(2,3) = 0.92 We hope to see that:
U(1,3) = [U(2,3) – U(1,3)] U(1,3) = (0.92 – 0.84) The value is Current value is a bit low and we must increase it.

Considerations Update values toward the equilibrium equation.
Update includes the successor only. Over many trials the updates converge toward optimal values.

Other heuristics Prioritized Sweeping:
Make adjustments to states where the most probable successors have undergone a large adjustment in terms of utility estimates.

Richard Sutton Author of classic textbook: “Reinforcement Learning”
by Sutton and Barto, MIT Press, 1998. Dept. of Computer Science University of Alberta

Active Reinforcement Learning
Now we must decide what actions to take. Optimal policy: Choose action with highest utility value. Is that the right thing to do?

Active Reinforcement Learning
No! Sometimes we may get stuck in suboptimal solutions. Exploration vs Exploitation Tradeoff Why is this important? The learned model is not the same as the true environment.

Explore vs Exploit Exploitation: Maximize its reward vs
Exploration: Maximize long-term well being.

Bandit Problem An n-armed bandit has n levers.
Which lever to play to maximize reward? In genetic algorithms the selection strategy is to allocate coins optimally given appropriate set of assumptions.

Solution U+ (s)  R(s) + γ maxa f(u,N(a,s))
U+ (s) : optimistic estimate of utility N(a,s): number of times action a has been tried. f(u,n): exploration function. Increasing in u (exploitation) Decreasing in n (exploration)

Applications Game Playing Checker playing program by
Arthur Samuel (IBM) Update rules: change weights by difference between current states and backed-up value generating full look-ahead tree

Applications Robot Control Cart-pole balancing problem.
Control the position of x so that the pole stays roughly upright.

Summary Goal is to learn utility values and an
optimal mapping from states to actions. Direct Utility Estimation ignores dependencies among states. We must follow Bellman Equations. Temporal difference updates values to match those of successor states. Active reinforcement learning learns What is machine learning?

Video http://www.youtube.com/watch?v=YQIMGV5vtd4
What is machine learning?

Reinforcement Learning

Similar presentations

Presentation on theme: "Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning

Similar presentations

Presentation on theme: "Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback