Reinforcement learning

Reinforcement learning
10/05/2016 Copyrights: Szepesvári Csaba: Megerősítéses tanulás (2004) Szita István, Lőrincz András: Megerősítéses tanulás (2005) Richard S. Sutton and Andrew G. Barto: Reinforcement Learning: An Introduction (1998)

Pavlov: Nomad 200 robot Nomad 200 simulator Sridhar Mahadevan UMass

Learning from interactions instead of using a static training set Reinforcement (reward/penalties) is usually delayed Observations about the environment (states) Objective: maximising reward (i.e. task-specific) +50 -1 +3 r9 r5 r4 r1 … … s1 s2 s3 s4 s5 … s9 a1 a2 a3 a4 a5 … a9

time: states: actions: reward: policy (strategy): deterministic: stochasztik: (s,a) is the likelihood that we choose action a being in state s (infinate horizon)

process: model of the environment: transition probabilites and reward objective: find a policy which maximises the expected value of total reward

Markov assumption → the dynamics of the system can be given by:

Markov Decision Processes (MDPs)
Stochastic transitions a1 r = 0 1 1 2 r = 2 a2

The exploration – exploitation dilemma The k-armed bandit bandit
Avg. reward rewards 10 0, 0, 5, 10, 35 5, 10, -15, -15, -10 -5 -20, 0, 50 agent 100 Maximising the reward on a long-term we have to explore the world’s dynamics then we can exploit this knowladge and collect reward.

Discounting infinate horizon it provides simple recursive formulas 
rt can be infinate! solution: discounting. Instead of rt we use t rt , <1 always finate it provides simple recursive formulas 

Markov Decision Process
environment changes according to P and R agent takes an action: we are looking for the optimal policy  which maximises

Long-term reward The policy p of the agent is fixed
Rt is the total discounted reward (return) after the step t +50 -1 +3 r9 r5 r4 r1

Value = expected total reward
The expected value of Rt depends on p V(s) is the value function Task: find optimal policy p* which maximises Rt in each state

We optimise (search for p
We optimise (search for p*) for the long-term reward instead of promptly (greedy) rewards at at+1 at+2 st st+1 st+2 st+3 rt+1 rt+2 rt+3

Bellman equation Based on the Markov assumption, a recursive formula can be derived for the expcted return: s 4 3 5 p(s)

Preference relation among policies
1 ≥ 2, iff a partial ordering * is optimal if * ≥  for every policy  optimal policy exists for every problem

example MDP 4 states, 2 actions
-10 A D C B objective 1 2 +100 4 states, 2 actions 10% chace to take the non-selected action

Two example policies A D C B 1 2 -10 +100 (A,1) = 1 (A,2) = 0 (B,1) = 1 (B,2) = 0 (C,1) = 1 (C,2) = 0 (D,1) = 1 (D,2) = 0

solution: solution for 2 :

a third policy 3(A,1) = 0,4 3(A,2) = 0,6 3(B,1) = 1 3(B,2) = 0 3(C,1) = 0 3(C,2) = 1 3(D,1) = 1 3(D,2) = 0 A D C B 1 2 -10 +100

solution:

Comparision of the 3 policies
1 2 3 A 75.61 77.78 B 87.56 68.05 87.78 C D 100 1 ≤ 3 and 2 ≤ 3 3 is optimal there can be many optimal policies! the optimal value function (V) is unique

Optimal policies and the Bellman equation
Q is the action-value function The optmal policies share the same value functon: Greedy policy: argmaxa Q*(s,a) The greedy policy is optimal!!!

Optimal policies and the Bellman equation
non-linear! has a unique solution solves the long-term planning problem

dynamic programming for MDPs DP
assume P and R are known Searching for optimal  Policy iteration Value iteration

Policy iteration

Jack's Car Rental Problem: Jack manages two locations for a nationwide car rental company. Each day, some number of customers arrive at each location to rent cars. If Jack has a car available, he rents it out and is credited $10 by the national company. If he is out of cars at that location, then the business is lost. Cars become available for renting the day after they are returned. To help ensure that cars are available where they are needed, Jack can move them between the two locations overnight, at a cost of $2 per car moved. We assume that the number of cars requested and returned at each location are Poisson random variables with parameter λ. Suppose λ is 3 and 4 for rental requests at the first and second locations and 3 and 2 for returns. To simplify the problem slightly, we assume that there can be no more than 20 cars at each location (any additional cars are returned to the nationwide company, and thus disappear from the problem) and a maximum of five cars can be moved from one location to the other in one night. We take the discount rate to be 0.9 and formulate this as an MDP, where the time steps are days, the state is the number of cars at each location at the end of the day, and the actions are the net numbers of cars moved between the two locations overnight.

Value iteration

Policy vs. value iteration
Policy iteration usually needs fewer steps but these step takes longer time t Value iteration converges in polinomial time Policy iteration converges but the it isn’t proved that it does this in polinomial time task-dependent...

If P and R are NOT known searching for V R(s): return starting from s
(random variable)

estimation of V(s) by Monte Carlo methods, MC
estimating R(s) by simulation (remember we don’t know P and R) take N episodes starting from s according to 

Monte Carlo policy evaluation

Temporal Differences (TD)
error of estimation: advantages: no need for a model (as in DP) no need to wait until the end of an episode (as in MC) TD uses an estimation for its estimation

TD for policy evaluation

DP, MC, TD 3 method for policy evaluation (estimation of V) DP: MC:
the model of the environment (P and R) are known exact computation of the expected value MC: make simulations by using the policy update the estimations at the end of the episodes TD: one step is used for updating it is noisy, so it updates only with a learning rate 

TD Sarsa Greedy action by probability e
Random action by probability 1-e

Exploration strategies
-greedy is pretty bad! exploration is a random walk bonuses for exploration (modified reward function): extra reward if new state is visited or latest visit’s time or TD error, etc. simple exploration: high initial values for V() it will explore every state as hope for a high reward

Regression-based reinforcement learning
if the number of states and/or actions is high it’ll be intractable Continous state and/or action spaces:

TD-gammon TD learning, neural network with with 1 single layer
1,500,000 game between variants achieved the level of world champion Backgammon state space: ~1020 , DP doesn’t work!!

Reinforcement learning

Similar presentations

Presentation on theme: "Reinforcement learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement learning

Similar presentations

Presentation on theme: "Reinforcement learning"— Presentation transcript:

Similar presentations

About project

Feedback