1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university

2 Passive Reinforcement Learning We will assume full observation Agent has a fix policy π Always executes π(s) Goal – to learn how good the policy is similar to policy evaluation But – doesn ’ t have all the knowledge Doesn ’ t know transition model T(s, a, s ’ ) Doesn ’ t know the reward function R(s)

3 example Our familiar 4x3 world Policy is known: Agent executes trails using the policy Trail start at (1,1) and experience sequence of states till reach terminal state +1 start

4 Example (cont.) Typical trails may be: (1,1) -.04  (1,2) -.04  (1,3) -.04  (1,2) -.04  (1,3) -.04  (2,3) -.04  (3,3) -.04  (4,3) +1 (1,1) -.04  (1,2) -.04  (1,3) -.04  (2,3) -.04  (3,3) -.04  (3,2) -.04  (3,3) -.04  (4,3) +1 (1,1) -.04  (2,1) -.04  (3,1) -.04  (3,2) -.04  (4,2) -1 +1 start

5 The goal Utility U π (s) : Expected sum of discounted rewards obtain for policy π May include learning model of environment

6 algorithms Direct utility estimation (DUE) Adaptive Dynamic Programming (ADP) Temporal Difference (DT)

7 Direct utility estimation Idea: Utility of state is the expected total reward from that state onward Each trail supply example/s of values for visited state Reward to go (of a state) the sum of the rewards from that state until a terminal state is reached

8 Example (1,1) -.04  (1,2) -.04  (1,3) -.04  (1,2) -.04  (1,3) -.04  (2,3) -.04  (3,3) -.04  (4,3) +1 U(1,1) = 0.72 U(1,2) = 0.76, 0.84 U(1,3) = 0.80, 0.88 U(2,3) = 0.92 U(3,3) = 0.96

9 algorithm Run over sequence of state (according to policy) Calculate observed “ reward to go ” for visited states Keeping average utility of each state in table

10 properties After infinity number of trails, the average will converge to true expectation Advantage Easy to compute No need of special actions disadvantage This is actually instance of supervised learning

11 disadvantage – expanding Similarity to supervised learning Each example has input (state) and output (observed reward to go) Reduce reinforcement learning to inductive learning lacking Missed dependency of neighbor states Utility of s = reward of s + expected utility of neighbors Doesn ’ t use the connection between states for learning Searches in hypothesis space larger than needs to Algorithm converge very slowly

12 example Second trail: (1,1) -.04  (1,2) -.04  (1,3) -.04  (2,3) -04  (3,3) -.04  (3,2) -.04  (3,3) -.04  (4,3) +1 (3,2) hasn ’ t been seen before (3,3) has been visited before and got high utility Learn about (3,2) only at the end of sequence Search in too much options …

13 Adaptive dynamic programming take advantage of connection between states Learn the transition model Solve markov decision process Running known policy Learns from observed sequences T(s,π(s),s’) Get R(s) from observed states Calculate utilities of states Use T(s,π(s),s’), R(s) in Bellman equation Solve the linear equations Instead might use simplified value iteration

14 Example In our three trails performs 3 times right in (1,3) 2 of these cases the result is (2,3) So T((1,3), right, (2,3)) estimates as 2/3

15 The algorithm Function PASSIVE_ADP_AGENT (percept) returns an action input: percept, a percept indicating the current state s’ and reward signal r’ static: π, a fixed policy mdp, an MDP with model T, rewards R, discount γ U, a table of utilities, initially empty N sa, a table of frequencies for state-action pairs, initially zero N sas’,, a table of frequencies for state-action pairs, initially zero a, s, the previous state and action, initially null if s’ is new then do U[s’]  r’; R[s’]  r’ if s is not null then do increment N sa [s,a] and N sas’ [s,a,s’] for each t such that N sas’ [s,a,t] is nonzero do T[s,a,t]  N sas’ [s,a,t]/N sa [s,a] U  VALUE_DETEMINATION(π,U,mdp) if TERMINAL?[s’] then s,a  null else s,a  s’,π[s’] return a

16 Properties Might seen like supervised learning input = state-action pair Output = resulting state Its Easy learning the model The environment is fully observation Algorithm does well as possible Provide standard for measuring reinforcement learning algorithms Good for large state spaces In backgammon solves 10 50 equations with 10 50 unknowns Disadvantage – a lot of work each time iteration

17 Performance in 4x3 world

18 Temporal difference learning Best of two world Allows approximate the constraint equations No need of solving equations for all possible states method Run according to policy π Use observed transitions to adjust utilities that they agree with the constraint equations.

19 Example As result of first trail U π (1,3) = 0.84 U π (2,3) = 0.92 We hope to see that: U(1,3) = -0.04 + U(2,3) = 0.88 So current estimate of 0.84 is a bit low and we must increase it.

20 In practice Watching transition occurs from s to s ’ update equation: U π (s)  U π (s) + α (R(s) + γ U π (s ’ ) − U π (s)) α is learning rate parameter. This is called temporal difference learning because update rule uses difference in utilities between successive states.

21 The algorithm Function PASSIVE_TD_AGENT (percept) returns an action input: percept, a percept indicating the current state s’ and reward signal r’ static: π, a fixed policy U, a table of utilities, initially empty N s, a table of frequencies for states, initially zero a, s, r, the previous state, action and reward initially null if s’ is new then do U[s’]  r’ if s is not null then do increment N s [s] U[s]  U[s] + α(N s [s])(r + γ U [s’] − U[s]) if TERMINAL?[s’] then s, a, r  null else s, a, r  s’,π[s’], r’ return a

22 Properties Update involves only observed successor s ’ Doesn ’ t take into account all possibilities efficient over large number of transitions Does not learn the model Environment supply the connection between neighboring states in form of observed transitions Average value of U π (s) will converge to correct value

23 quality Average value of U π (s) will converge to correct value if defined as a function that decreases as the number of times a state is visited increases, then U(s) will converge to correct value. We require: The function (n) = 1/n satisfies these conditions.

24 Performance in 4x3 world

25 TD vs. ADP TD: Doesn ’ t learn as fast as ADP Shows higher variability than ADP Simpler than ADP Much less computation per observation than ADP Does not need a model to perform updates Makes state updates to agree with observed successor (instead of all successors, like ADP) TD can be viewed as a crude, yet efficient, first approximation to ADP.

26 TD vs ADP Function PASSIVE_TD_AGENT (percept) returns an action if s’ is new then do U[s’]  r’ if s is not null then do increment N s [s] U[s]  U[s] + α(N s [s])(r + γ U [s’] − U[s]) if TERMINAL?[s’] then s, a, r  null else s, a, r  s’,π[s’], r’ return a PASSIVE_ADP_AGENT if s’ is new then do U[s’]  r’; R[s’]  r’ if s is not null then do increment N sa [s,a] and N sas’ [s,a,s’] for each t such that N sas’ [s,a,t] is nonzero do T[s,a,t]  N sas’ [s,a,t]/N sa [s,a] U  VALUE_DETEMINATION(π,U,mdp) if TERMINAL?[s’] then s,a  null else s,a  s’,π[s’] return a

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Similar presentations

Presentation on theme: "1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Similar presentations

Presentation on theme: "1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university."— Presentation transcript:

Similar presentations

About project

Feedback