Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reinforcement learning

Similar presentations


Presentation on theme: "Reinforcement learning"— Presentation transcript:

1 Reinforcement learning
02/05/2017 Copyrights: Szepesvári Csaba: Megerősítéses tanulás (2004) Szita István, Lőrincz András: Megerősítéses tanulás (2005) Richard S. Sutton and Andrew G. Barto: Reinforcement Learning: An Introduction (1998)

2 Reinforcement learning

3 Reinforcement learning

4 Reinforcement learning

5 Reinforcement learning
Pavlov: Nomad 200 robot Nomad 200 simulator Sridhar Mahadevan UMass

6 Reinforcement learning
Controll tasks Planning of multiple actions Learning from interaction Objective: maximising reward (i.e. task-specific) +50 -1 +3 r9 r5 r4 r1 s1 s2 s3 s4 s5 s9 a1 a2 a3 a4 a5 a9

7 Supervised vs Reinforcement learning
Both are machine learning Supervised Reinforcement Prompt supervision Late, indirect reinforcement Passive learnng (training dataset is given) Active learning (actions taken by the system which will be then reinforced)

8 Reinforcement learning
time: states: actions: reward: policy (strategy): deterministic: stochasztik: (s,a) is the likelihood that we choose action a being in state s (infinate horizon)

9 process: model of the environment: transition probabilites and reward objective: find a policy which maximises the expected value of total reward

10 Markov assumption → the dynamics of the system can be given by:

11 Markov Decision Processes (MDPs)
Stochastic transitions a1 r = 0 1 1 2 r = 2 a2

12 The exploration – exploitation dilemma The k-armed bandit bandit
Avg. reward rewards 10 0, 0, 5, 10, 35 5, 10, -15, -15, -10 -5 -20, 0, 50 agent 100 Maximising the reward on a long-term we have to explore the world’s dynamics then we can exploit this knowladge and collect reward.

13 Discounting infinate horizon rt can be infinate!
solution: discounting. Instead of rt we use t rt , <1 always finate

14 Markov Decision Process
environment changes according to P and R agent takes an action: we are looking for the optimal policy  which maximises

15 Long-term reward The policy p of the agent is fixed
Rt is the total discounted reward (return) after the step t +50 -1 +3 r9 r5 r4 r1

16 Value = expected total reward
The expected value of Rt depends on p V(s) is the value function Task: find optimal policy p* which maximises Rt in each state

17 We optimise (search for p
We optimise (search for p*) for the long-term reward instead of promptly (greedy) rewards at at+1 at+2 st st+1 st+2 st+3 rt+1 rt+2 rt+3

18 Bellman equation Based on the Markov assumption, a recursive formula can be derived for the expcted return: s 4 3 5 p(s)

19 Preference relation among policies
1 ≥ 2, iff a partial ordering * is optimal if * ≥  for every policy  optimal policy exists for every problem

20 example MDP 4 states, 2 actions
-10 A D C B objective 1 2 +100 4 states, 2 actions 10% chace to take the non-selected action

21 Two example policies A D C B 1 2 -10 +100 (A,1) = 1 (A,2) = 0 (B,1) = 1 (B,2) = 0 (C,1) = 1 (C,2) = 0 (D,1) = 1 (D,2) = 0

22

23 solution: solution for 2 :

24 a third policy 3(A,1) = 0,4 3(A,2) = 0,6 3(B,1) = 1 3(B,2) = 0 3(C,1) = 0 3(C,2) = 1 3(D,1) = 1 3(D,2) = 0 A D C B 1 2 -10 +100

25

26 solution:

27 Comparision of the 3 policies
1 2 3 A 75.61 77.78 B 87.56 68.05 87.78 C D 100 1 ≤ 3 and 2 ≤ 3 3 is optimal there can be many optimal policies! the optimal value function (V) is unique

28 Optimal policies and the Bellman equation
Q is the action-value function The optmal policies share the same value functon: Greedy policy: argmaxa Q*(s,a) The greedy policy is optimal!!!

29 Optimal policies and the Bellman equation
non-linear! has a unique solution solves the long-term planning problem

30 dynamic programming for MDPs DP
assume P and R are known Searching for optimal  Policy iteration Value iteration

31 Policy iteration

32 Jack's Car Rental Problem: Jack manages two locations for a nationwide car rental company. Each day, some number of customers arrive at each location to rent cars. If Jack has a car available, he rents it out and is credited $10 by the national company. If he is out of cars at that location, then the business is lost. Cars become available for renting the day after they are returned. To help ensure that cars are available where they are needed, Jack can move them between the two locations overnight, at a cost of $2 per car moved. We assume that the number of cars requested and returned at each location are Poisson random variables with parameter λ. Suppose λ is 3 and 4 for rental requests at the first and second locations and 3 and 2 for returns. To simplify the problem slightly, we assume that there can be no more than 20 cars at each location (any additional cars are returned to the nationwide company, and thus disappear from the problem) and a maximum of five cars can be moved from one location to the other in one night. We take the discount rate to be 0.9 and formulate this as an MDP, where the time steps are days, the state is the number of cars at each location at the end of the day, and the actions are the net numbers of cars moved between the two locations overnight.

33 If P and R are NOT known searching for V R(s): return starting from s
(random variable)

34 estimation of V(s) by Monte Carlo methods, MC
estimating R(s) by simulation (remember we don’t know P and R) take N episodes starting from s according to 

35 Monte Carlo policy evaluation


Download ppt "Reinforcement learning"

Similar presentations


Ads by Google