Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision Process” 321321 1 2 3 4 +1 –1–1 s0s0 Actions are stochastic: E.g., in book example, probability of taking “desired” action is 0.8, and probability of taking action at right angle to desired action is 0.2.

Utility of a state sequence = discounted sum of rewards s0s0 321321 1 2 3 4 +1 –1–1 s0s0

321321 1 2 3 4 +1 –1–1 s0s0 Policy: Function that maps states to actions : Optimal policy  *: has highest expected utility

Utility of state s given policy  : where s t is the state reached after starting in s and executing  for t steps. s0s0 321321 1 2 3 4 +1 –1–1 s0s0

Define: s0s0 321321 1 2 3 4 +1 –1–1 s0s0

Suppose that the agent, in s at time t, following  *, to s´ at time t+1. We can write U(s) in terms of U(s´): Bellman’s Equation

Suppose that the agent, in s at time t, following  *, to s´ at time t+1. We can write U(s) in terms of U(s´): Bellman’s Equation Bellman’s equation yields a set of simultaneous equations that can be solved (given certain assumptions) to find utilities.

s0s0 321321 1 2 3 4 +1 –1–1 0.8120.8680.918 0.7620.660 0.7050.6550.6110.388 State utilities

How to learn an optimal policy? Value iteration: –Calculate utility of each state and then using the state utilities to select optimal action in each state.

How to learn an optimal policy? Value iteration: –Calculate utility of each state and then using the state utilities to select optimal action in each state. But: must know R(s) and T(s, a, s´). In most problems, the agent doesn’t have this knowledge.

Reinforcement Learning Agent has no teacher (contrast with NN), no prior knowledge of reward function or of state transition function. “Imagine playing a new game whose rules you don’t know; after a hundred or so moves, your opponent announces ‘you lose’. This is reinforcement learning in a nutshell.” (Textbook) Question: How best to explore “on-line” (while getting rewards and punishment?) Analogous to multi-armed bandit problem I mentioned earlier.

Q-learning Don’t learn utilities! Instead, learn a “value” function, Q: S  A   Q(s,a) = estimated value of U(s), for best action a from state s. “Model-free” method If we knew Q(s,a) for each state/action pair, we could simply choose the action which maximizes Q.

How to learn Q (simplified from Figure 21.8) Assume T: S  A  S is a deterministic state transition function. Assume  =  = 1.

How to learn Q (simplified from Figure 21.8) Q-Learn() { Q-matrix = 0; // All zeros t = 0; s = s 0 ; While s is not a terminal state { choose action a; // Many different // ways to do this. s’ = T(s,a); Q(s,a) = R(s) + max_a’ Q(s’,a’); s = s’; }

Pathfinder demo

321321 1 2 3 4 +1 –1–1 s0s0 How to do HW problem 4

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Similar presentations

Presentation on theme: "Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Similar presentations

Presentation on theme: "Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision."— Presentation transcript:

Similar presentations

About project

Feedback