Presentation is loading. Please wait.

Presentation is loading. Please wait.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Similar presentations


Presentation on theme: "Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision."— Presentation transcript:

1 Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision Process” 321321 1 2 3 4 +1 –1–1 s0s0 Actions are stochastic: E.g., in book example, probability of taking “desired” action is 0.8, and probability of taking action at right angle to desired action is 0.2.

2 Utility of a state sequence = discounted sum of rewards s0s0 321321 1 2 3 4 +1 –1–1 s0s0

3 321321 1 2 3 4 +1 –1–1 s0s0 Policy: Function that maps states to actions : Optimal policy  *: has highest expected utility

4 Utility of state s given policy  : where s t is the state reached after starting in s and executing  for t steps. s0s0 321321 1 2 3 4 +1 –1–1 s0s0

5 Define: s0s0 321321 1 2 3 4 +1 –1–1 s0s0

6 Suppose that the agent, in s at time t, following  *, to s´ at time t+1. We can write U(s) in terms of U(s´): Bellman’s Equation

7 Suppose that the agent, in s at time t, following  *, to s´ at time t+1. We can write U(s) in terms of U(s´): Bellman’s Equation Bellman’s equation yields a set of simultaneous equations that can be solved (given certain assumptions) to find utilities.

8 s0s0 321321 1 2 3 4 +1 –1–1 0.8120.8680.918 0.7620.660 0.7050.6550.6110.388 State utilities

9 How to learn an optimal policy? Value iteration: –Calculate utility of each state and then using the state utilities to select optimal action in each state.

10 How to learn an optimal policy? Value iteration: –Calculate utility of each state and then using the state utilities to select optimal action in each state. But: must know R(s) and T(s, a, s´). In most problems, the agent doesn’t have this knowledge.

11 Reinforcement Learning Agent has no teacher (contrast with NN), no prior knowledge of reward function or of state transition function. “Imagine playing a new game whose rules you don’t know; after a hundred or so moves, your opponent announces ‘you lose’. This is reinforcement learning in a nutshell.” (Textbook) Question: How best to explore “on-line” (while getting rewards and punishment?) Analogous to multi-armed bandit problem I mentioned earlier.

12 Q-learning Don’t learn utilities! Instead, learn a “value” function, Q: S  A   Q(s,a) = estimated value of U(s), for best action a from state s. “Model-free” method If we knew Q(s,a) for each state/action pair, we could simply choose the action which maximizes Q.

13 How to learn Q (simplified from Figure 21.8) Assume T: S  A  S is a deterministic state transition function. Assume  =  = 1.

14 How to learn Q (simplified from Figure 21.8) Q-Learn() { Q-matrix = 0; // All zeros t = 0; s = s 0 ; While s is not a terminal state { choose action a; // Many different // ways to do this. s’ = T(s,a); Q(s,a) = R(s) + max_a’ Q(s’,a’); s = s’; }

15 Pathfinder demo

16 321321 1 2 3 4 +1 –1–1 s0s0 How to do HW problem 4


Download ppt "Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision."

Similar presentations


Ads by Google