Markov Decision Processes

Markov Decision Processes
CSE 6363 – Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

A Sequential Decision Problem
+1 -1 START A Sequential Decision Problem 1 2 3 4 This example is taken from: S. Russell and P. Norvig, "Artificial Intelligence: A Modern Approach", third edition (2009), Prentice Hall. We have an environment that is a 3×4 grid. We have an agent, that starts at position (1,1). There are (at most) four possible actions: go left, right, up, or down. Position (2,2) cannot be reached. Positions are denoted as (row,col).

A Sequential Decision Problem
+1 -1 START A Sequential Decision Problem 1 2 3 4 Positions (2,4) and 3,4 are terminal. A mission is a sequence of actions, that starts with the agent at the START position, and ends with the agent at a terminal position. If the agent reaches position (3,4), the reward is +1. If the agent reaches position (2,4), the reward is −1 (so it is actually a penalty). The agent wants to maximize the total rewards gained during its mission.

A Deterministic Case +1 -1 3 START 2 1
4 Under some conditions, the solution for reward maximization is easy to find. Suppose that each action always succeeds: The "go left" action takes you one position to the left. The "go right" action takes you one position to the right. The "go up" action takes you one position upwards. The "go down" action takes you one position downwards. This situation is called deterministic. A deterministic environment is an environment where the result of any action is known in advance. A non-deterministic environment is an environment where the result of any action is not known in advance.

4 Under some conditions, the solution for reward maximization is easy to find. Suppose that each action always succeeds: The "go left" action takes you one position to the left. The "go right" action takes you one position to the right. The "go up" action takes you one position upwards. The "go down" action takes you one position downwards. Suppose that any non-terminal state yields a reward of −0.04. Then, what is the optimal sequence of actions?

4 Under some conditions, the solution for reward maximization is easy to find. Suppose that each action always succeeds: The "go left" action takes you one position to the left. The "go right" action takes you one position to the right. The "go up" action takes you one position upwards. The "go down" action takes you one position downwards. Suppose that any non-terminal state yields a reward of −0.04. Then, what is the optimal sequence of actions? Up, up, right, right, right gets the agent from START to position (3,4). Total rewards: 1 −5∗.04=0.8 (five non-terminal states, including START).

4 Under some conditions, the solution for reward maximization is easy to find. Suppose that each action always succeeds: The "go left" action takes you one position to the left. The "go right" action takes you one position to the right. The "go up" action takes you one position upwards. The "go down" action takes you one position downwards. Suppose that any non-terminal state yields a reward of −0.04. The optimal sequence is not unique. Right, right, up, up, right is also optimal. Total rewards: 1 −5∗.04=0.8 (five non-terminal states, including START).

4 Under some conditions, the solution for reward maximization is easy to find. Suppose that each action always succeeds: The "go left" action takes you one position to the left. The "go right" action takes you one position to the right. The "go up" action takes you one position upwards. The "go down" action takes you one position downwards. Suppose that any non-terminal state yields a reward of −0.04. The optimal sequence can be found using well-known algorithms such as breadth-first search.

A Non-Deterministic Case
+1 -1 START A Non-Deterministic Case 1 2 3 4 Under some conditions, life gets more complicated. Suppose that each action: Succeeds with probability 0.8. Has a 0.2 probability of moving to a direction that differs by 90 degrees from the intended direction. For example: the "go up" action: Has a probability of 0.8 to take the agent one position upwards. Has a probability of 0.1 to take the agent one position to the left. Has a probability of 0.1 to take the agent one position to the right.

A Non-Deterministic Case
+1 -1 START A Non-Deterministic Case 1 2 3 4 Under some conditions, life gets more complicated. Suppose that each action: Succeeds with probability 0.8. Has a 0.2 probability of moving to a direction that differs by 90 degrees from the intended direction. Suppose that bumping into the wall leads to not moving. For example: The agent is at position (1,1). The agent executes the "go up" action. Due to bad luck, the action moves the agent to the left. The agent hits the wall, and remains at position (1,1).

Sequential Decision Problems
+1 -1 START Sequential Decision Problems 1 2 3 4 Under some conditions, life gets more complicated. Suppose that each action: Succeeds with probability 0.8. Has a 0.2 probability of moving to a direction that differs by 90 degrees from the intended direction. Suppose that bumping into the wall leads to not moving. In that case, choosing the best action to take at each position is a more complicated problem. A sequential decision problem consists of choosing the best sequence of actions, so as to maximize the total rewards.

Markov Decision Processes (MDPs)
+1 -1 START Markov Decision Processes (MDPs) 1 2 3 4 A Markov Decision Process (MDP) is a sequential decision problem, with some additional assumptions. Assumption 1: Markovian Transition Model. The probability 𝑝 𝑠 ′ 𝑠,𝑎, 𝐻) is the probability of ending up in state 𝑠 ′ , given: The previous state 𝑠, where the agent was taking the last action. The last action 𝑎. The history 𝐻 of all prior actions and states since the start of the mission. In a Markovian transition model, 𝑝 𝑠 ′ 𝑠,𝑎, 𝐻)=𝑝 𝑠 ′ 𝑠,𝑎) Given the last state, the history does not matter.

A Transition Model Example
Suppose that each action: Succeeds with probability 0.8. Has a 0.2 probability of moving to a direction that differs by 90 degrees from the intended direction. Suppose that bumping into the wall leads to not moving. 𝑝 (1,1) (1,1),"left")= ??? 𝑝 (2,1) (1,1),"left")= ??? 𝑝 (1,2) (1,1),"left")= ??? +1 -1 START 1 2 3 4

Suppose that each action: Succeeds with probability 0.8. Has a 0.2 probability of moving to a direction that differs by 90 degrees from the intended direction. Suppose that bumping into the wall leads to not moving. 𝑝 (1,1) (1,1),"left")= 0.9 0.8 chance of going left and hitting the wall. 0.1 chance of going down and hitting the wall. 𝑝 (2,1) (1,1),"left")= 0.1 𝑝 (1,2) (1,1),"left")= 0 If you try to go left, you never end up going right. +1 -1 START 1 2 3 4

Suppose that each action: Succeeds with probability 0.8. Has a 0.2 probability of moving to a direction that differs by 90 degrees from the intended direction. Suppose that bumping into the wall leads to not moving. 𝑝 (1,1) (1,1),"right")= ??? 𝑝 (2,1) (1,1),"right")= ??? 𝑝 (1,2) (1,1),"right")= ??? +1 -1 START 1 2 3 4

Suppose that each action: Succeeds with probability 0.8. Has a 0.2 probability of moving to a direction that differs by 90 degrees from the intended direction. Suppose that bumping into the wall leads to not moving. 𝑝 (1,1) (1,1),"right")= 0.1 0.1 chance of going down and hitting the wall. 𝑝 (2,1) (1,1),"right")= 0.1 𝑝 (1,2) (1,1),"right")= 0.8 +1 -1 START 1 2 3 4

Suppose that each action: Succeeds with probability 0.8. Has a 0.2 probability of moving to a direction that differs by 90 degrees from the intended direction. Suppose that bumping into the wall leads to not moving. 𝑝 (1,1) (1,1),"up")= ??? 𝑝 (2,1) (1,1),"up")= ??? 𝑝 (1,2) (1,1),"up")= ??? +1 -1 START 1 2 3 4

Suppose that each action: Succeeds with probability 0.8. Has a 0.2 probability of moving to a direction that differs by 90 degrees from the intended direction. Suppose that bumping into the wall leads to not moving. 𝑝 (1,1) (1,1),"up")= 0.1 𝑝 (2,1) (1,1),"up")= 0.8 𝑝 (1,2) (1,1),"up")= 0.1 +1 -1 START 1 2 3 4

Suppose that each action: Succeeds with probability 0.8. Has a 0.2 probability of moving to a direction that differs by 90 degrees from the intended direction. Suppose that bumping into the wall leads to not moving. 𝑝 (1,1) (1,1),"down")= ??? 𝑝 (2,1) (1,1),"down")= ??? 𝑝 (1,2) (1,1),"down")= ??? +1 -1 START 1 2 3 4

Suppose that each action: Succeeds with probability 0.8. Has a 0.2 probability of moving to a direction that differs by 90 degrees from the intended direction. Suppose that bumping into the wall leads to not moving. 𝑝 (1,1) (1,1),"down")= 0.9 0.8 chance of going down and hitting the wall. 0.1 chance of going left and hitting the wall. 𝑝 (2,1) (1,1),"down")= 0 If you try to go down, you never end up going up. 𝑝 (1,2) (1,1),"down")= 0.1 +1 -1 START 1 2 3 4

Suppose that each action: Succeeds with probability 0.8. Has a 0.2 probability of moving to a direction that differs by 90 degrees from the intended direction. Suppose that bumping into the wall leads to not moving. In a similar way, we can define all probabilities 𝑝 𝑠 ′ 𝑠,𝑎) for: Every one of the 11 legal values for state 𝑠. Every one of the 2 to 4 legal values for neighbor 𝑠 ′ . Every one of the 4 legal values for action 𝑎. +1 -1 START 1 2 3 4

Markov Decision Processes (MDPs)
+1 -1 START Markov Decision Processes (MDPs) 1 2 3 4 Assumption 2: Discounted Additive Rewards. The utility 𝑈 ℎ of a state sequence 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 is: 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 = 𝑡=0 𝑇 𝛾 𝑡 𝑅 𝑠 𝑡 In the above equation: 𝑅 𝑠 is the reward function, mapping each state 𝑠 to a reward. 𝛾 is called the discount factor, 0≤𝛾≤1.

Discounted Additive Rewards
+1 -1 START Discounted Additive Rewards 1 2 3 4 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 = 𝑡=0 𝑇 𝛾 𝑡 𝑅 𝑠 𝑡 Suppose that 𝛾=1. Then: 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 = 𝑡=0 𝑇 𝑅 𝑠 𝑡 Therefore, when 𝛾=1, the utility function is additive. It is simply the sum of the rewards of all states in the sequence.

+1 -1 START Discounted Additive Rewards 1 2 3 4 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 = 𝑡=0 𝑇 𝛾 𝑡 𝑅 𝑠 𝑡 When 𝛾<1, the above formula indicates that the agent prefers immediate rewards over future rewards. The agent is at state 𝑠 0 , considering what to do next. Sequence 𝑠 1 ,…, 𝑠 𝑇 is a possible sequence of future states. As 𝑡 increases, 𝛾 𝑡 decreases exponentially towards 0. Thus, rewards coming far into the future (large 𝑡) are heavily discounted, with factor 𝛾 𝑡 that quickly gets close to 0.

+1 -1 START Discounted Additive Rewards 1 2 3 4 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 = 𝑡=0 𝑇 𝛾 𝑡 𝑅 𝑠 𝑡 This type of utility is called discounted additive rewards, since: The utility is additive, it is a (weighted) summation of rewards attained at individual states. The reward at each state 𝑠 𝑡 is discounted by factor 𝛾 𝑡 . When 𝛾=1, then we simply have additive rewards.

+1 -1 START Discounted Additive Rewards 1 2 3 4 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 = 𝑡=0 𝑇 𝛾 𝑡 𝑅 𝑠 𝑡 When does it make sense to use 𝛾<1, so that future rewards get discounted? Discounted rewards are (unfortunately?) good models of human behavior. Slacking now is often preferable, versus acing the exam later. The reward for slacking is relatively low but immediate. The reward for acing the exam is higher, but more remote.

+1 -1 START Discounted Additive Rewards 1 2 3 4 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 = 𝑡=0 𝑇 𝛾 𝑡 𝑅 𝑠 𝑡 When does it make sense to use 𝛾<1, so that future rewards get discounted? Discounted rewards are also a way to get an agent to focus on the near term. We often want our intelligent agents to achieve results within a specific time window. In that case, discounted rewards de-emphasize the contribution of states reached beyond that time window.

The MDP Problem +1 -1 START 1 2 3 4 When we have an MDP process, the problem that we typically want to solve is to find an optimal policy. A policy 𝜋 𝑠 is a function mapping states to actions. When the agent is at state 𝑠, the policy tells the agent to perform action 𝜋 𝑠 . An optimal policy 𝜋 ∗ is a policy that maximizes the expected utility. The expected utility of a policy 𝜋 is the average utility attained per mission, when the agent carries out an infinite number of missions following that policy 𝜋.

Policy Examples A policy 𝜋 𝑠 is a function mapping states to actions.
+1 -1 1 2 3 4 A policy 𝜋 𝑠 is a function mapping states to actions. An optimal policy 𝜋 ∗ is a policy that maximizes the expected utility. The figure shows an example policy, that happens to be optimal when: 𝑅 𝑠 =−0.04 for non-terminal states 𝑠. 𝛾=1.

Policy Examples +1 -1 3 2 Top figure: the optimal policy for: +1 1 -1
4 Top figure: the optimal policy for: 𝑅 𝑠 =−0.04 for non-terminal states 𝑠. 𝛾=1. Bottom figure: the optimal policy for: 𝑅 𝑠 =−0.02 for non-terminal states 𝑠. Changing 𝑅 𝑠 from −0.04 to −0.02 makes longer sequences less costly. +1 -1 1 2 3 4

Policy Examples +1 -1 3 2 Top figure: the optimal policy for: +1 1 -1
4 Top figure: the optimal policy for: 𝑅 𝑠 =−0.04 for non-terminal 𝑠. 𝛾=1. Bottom figure: the optimal policy for: 𝑅 𝑠 =−0.1 for non-terminal 𝑠. Changing 𝑅 𝑠 from −0.04 to −0.1 makes longer sequences more costly. It is worth taking risks to reach the +1 state as fast as possible. +1 -1 1 2 3 4

Utility of a State +1 -1 3 START 2
4 In order to figure out how to compute the optimal policy 𝜋 ∗ , we need to study some of its properties. We define the utility 𝑈( 𝑠 0 ) of a state 𝑠 0 as the expected value E 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 , measured over all possible sequences 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 that can happen if the agent follows policy 𝜋 ∗ . Obviously, we assume that the agent knows 𝜋 ∗ , in order to follow that policy. If the agent follows a specific policy 𝜋 ∗ , why are there multiple possible sequences of future states? 𝜋 ∗ (𝑠) tells us the action the agent will take at any state 𝑠, but, remember, the result of the action is non-deterministic. The probability that action 𝜋 ∗ (𝑠) will lead to state 𝑠 ′ is modeled by the state transition function 𝑝 𝑠 ′ 𝑠, 𝜋 ∗ (𝑠))

A Note on Notation +1 -1 START 1 2 3 4 Note that we have defined three different utility-related functions. 𝑅( 𝑠 0 ) is the immediate reward obtained when the agent reaches state 𝑠 0 . 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 is the (possibly discounted) sum of rewards of states 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 . Thus, 𝑈 ℎ 𝑠 0 =𝑅 𝑠 0 , since 𝑈 ℎ 𝑠 0 = 𝑡=0 0 𝛾 𝑡 𝑅 𝑠 𝑡 𝑈 𝑠 0 is the expected value E 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 , measured over all possible sequences 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 that can happen if the agent is at state 𝑠 0 and the agent follows the optimal policy 𝝅 ∗ .

Utility of a Sequence +1 -1 3 START 2
4 Suppose that any non-terminal state yields a reward of −0.04. Suppose that 𝛾=0.9. Let's consider a state sequence 𝐒 defined as: 𝐒= 1,1 , 1,2 , 1,3 , 2,3 ,(3,3),(3,2),(3,3),(4,3) How do we compute 𝑈 ℎ 𝐒 ? 𝑈 ℎ 𝐒 𝑡=0 𝑇 𝛾 𝑡 𝑅 𝑠 𝑡 = 𝑅 1, 𝑅 1, 𝑅 1, 𝑅 2, 𝑅 3, 𝑅 3, 𝑅 3, 𝑅 4,3 =1∗(−0.04)+0.9∗(−0.04)+0.81∗(−0.04)+0.73∗(−0.04) ∗(−0.04)+0.59∗(−0.04)+0.53∗(−0.04)+0.48∗1

Utility of a Sequence +1 -1 3 START 2
4 Suppose that any non-terminal state yields a reward of −0.04. Suppose that 𝛾=0.9. Let's consider a state sequence 𝐒 defined as: 𝐒= 1,1 , 1,2 , 1,3 , 2,3 ,(3,3),(3,2),(3,3),(4,3) How do we compute 𝑈 ℎ 𝐒 ? 𝑈 ℎ 𝐒 𝑡=0 𝑇 𝛾 𝑡 𝑅 𝑠 𝑡 =0.27

Utility of a State +1 START -1 1 2
What is the utility of state 2,1 in this toy example? U 𝑠 0 =E 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 , measured over all possible sequences we can get if: We start from 𝑠 0 . We continue till we reach a terminal state. We follow the optimal policy 𝜋 ∗ . If we start with 𝑠 0 =(2,1), what are all possible sequences 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 ? Since (2,1) is a terminal state, the only possible sequence is (2,1) . Thus, U 2,1 = 𝑈 ℎ (2,1) =???

What is the utility of state 2,1 in this toy example? U 𝑠 0 =E 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 , measured over all possible sequences we can get if: We start from 𝑠 0 . We continue till we reach a terminal state. We follow the optimal policy 𝜋 ∗ . If we start with 𝑠 0 =(2,1), what are all possible sequences 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 ? Since (2,1) is a terminal state, the only possible sequence is (2,1) . Thus, U 2,1 = 𝑈 ℎ (2,1) =1.

What is the utility of state 1,2 in this toy example? U 𝑠 0 =E 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 , measured over all possible sequences we can get if: We start from 𝑠 0 . We continue till we reach a terminal state. We follow the optimal policy 𝜋 ∗ . Since (1,2) is a terminal state, the only possible sequence is (1,2) . Thus, U 1,2 = 𝑈 ℎ (1,2) =−1.

Utility of a State +1 START -1 1 2 What is the utility of state 1,1 ?
U 𝑠 0 =E 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 , measured over all possible sequences we can get if: We start from 𝑠 0 . We continue till we reach a terminal state. We follow the optimal policy 𝜋 ∗ . (1,1) is not a terminal state. How many possible sequences are there?

U 𝑠 0 =E 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 , measured over all possible sequences we can get if: We start from 𝑠 0 . We continue till we reach a terminal state. We follow the optimal policy 𝜋 ∗ . (1,1) is not a terminal state. There are infinitely many possible sequences. Assuming 𝛾=0.9: 1,1 , 2,1 , with utility 𝑈 ℎ =− ∗1=0.86 1,1 , 1,2 , with utility 𝑈 ℎ =− ∗(−1)=0.94 1,1 , 1,1 , 2,1 , with 𝑈 ℎ =− ∗ − ∗1=0.84 1,1 , 1,1 , 1,2 , 𝑈 ℎ =− ∗ − ∗ −1 =−0.89 …

U 𝑠 0 =E 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 . There are infinitely many possible sequences. Assuming 𝛾=0.9: 1,1 , 2,1 , with utility 𝑈 ℎ =− ∗1=0.86 1,1 , 1,2 , with utility 𝑈 ℎ =− ∗ −1 =−0.94 1,1 , 1,1 , 2,1 , with 𝑈 ℎ =− ∗ − ∗1=0.84 1,1 , 1,1 , 1,2 , 𝑈 ℎ =− ∗ − ∗ −1 =−0.89 1,1 , 1,1 , 1,1 , 2,1 1,1 , 1,1 , 1,1 , 1,2 1,1 , 1,1 , 1,1 , 1,1 , 2,1 1,1 , 1,1 , 1,1 , 1,1 , 1,2 ….

U 𝑠 0 =E 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 , measured over all possible sequences we can get if: We start from 𝑠 0 . We continue till we reach a terminal state. We follow the optimal policy 𝜋 ∗ . There are infinitely many possible sequences. Assuming 𝛾=0.9: 1,1 , 2,1 , with utility 𝑈 ℎ =− ∗1=0.86 1,1 , 1,2 , with utility 𝑈 ℎ =− ∗ −1 =−0.94 1,1 , 1,1 , 2,1 , with 𝑈 ℎ =− ∗ − ∗1=0.84 1,1 , 1,1 , 1,2 , 𝑈 ℎ =− ∗ − ∗ −1 =−0.89 … How can we measure the expected value of 𝑈 ℎ over this infinite set of sequences?

U 𝑠 0 =E 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 , measured over all possible sequences we can get if: We start from 𝑠 0 . We continue till we reach a terminal state. We follow the optimal policy 𝜋 ∗ . There are infinitely many possible sequences. E 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 is a weighted average, where the weight of each state sequence is the probability of that sequence, assuming that we are following the optimal policy 𝝅 ∗ . What is the optimal policy 𝝅 ∗ ? It is the one that maximizes U 𝑠 for all states 𝑠. It looks like a chicken-and-egg problem: we must know 𝝅 ∗ to compute U 𝑠 , and we must know values U 𝑠 to compute 𝝅 ∗ .

U 𝑠 0 =E 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 . Suppose that, for state 1,1 , the optimal action is "up". We will prove that "up" is indeed optimal, a bit later. If the agent follows the optimal policy then, after one "up" action: With probability the agent gets to state 2,1 . 𝑈 ℎ 1,1 , 2,1 =− ∗1=0.86 With probability the agent gets to state 1,2 . 𝑈 ℎ 1,1 , 1,2 =− ∗ −1 =−0.94 With probability 0.1, the agent stays at state 1,1 . So: U 1,1 =E 𝑈 ℎ 1,1 , 𝑠 1 ,…, 𝑠 𝑇 =0.8∗ ∗ − ∗𝑋 In the above, 𝑋 is the expected utility if 𝑠 0 = 𝑠 1 = 1,1 . Let's see how to compute 𝑋.

U 𝑠 0 =E 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 . Suppose that 𝑠 0 = 𝑠 1 = 1,1 . What is the expected utility in that case? E 𝑈 ℎ 1,1 , 1,1 , 𝑠 2 ,…, 𝑠 𝑇 can be decomposed as: The reward for state 𝑠 0 , which is known: R 1,1 =−0.04 The expected value of the rewards for states 𝑠 1 = 1,1 , 𝑠 2 ,…, 𝑠 𝑇 , which will be E 𝛾R 1,1 + 𝛾 2 R 𝑠 2 + 𝛾 3 R 𝑠 3 +…+ 𝛾 𝑇 R 𝑠 𝑇 . So: E 𝑈 ℎ 1,1 , 1,1 , 𝑠 2 ,…, 𝑠 𝑇 = −0.04+E 𝛾R 1,1 + 𝛾 2 R 𝑠 2 + 𝛾 3 R 𝑠 3 +…+ 𝛾 𝑇 R 𝑠 𝑇 = −0.04+𝛾E R 1,1 +𝛾R 𝑠 2 + 𝛾 2 R 𝑠 3 +…+ 𝛾 𝑇−1 R 𝑠 𝑇 The expression highlighted in red is the expected utility over all sequences starting at state 1,1 , which is the definition of U 1,1 .

U 𝑠 0 =E 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 . Suppose that 𝑠 0 = 𝑠 1 = 1,1 . What is the expected utility in that case? E 𝑈 ℎ 1,1 , 1,1 , 𝑠 2 ,…, 𝑠 𝑇 can be decomposed as: The reward for state 𝑠 0 , which is known: R 1,1 =−0.04 The expected value of the rewards for states 𝑠 1 = 1,1 , 𝑠 2 ,…, 𝑠 𝑇 , which will be E 𝛾R 1,1 + 𝛾 2 R 𝑠 2 + 𝛾 3 R 𝑠 3 +…+ 𝛾 𝑇 R 𝑠 𝑇 . So: E 𝑈 ℎ 1,1 , 1,1 , 𝑠 2 ,…, 𝑠 𝑇 = −0.04+E 𝛾R 1,1 + 𝛾 2 R 𝑠 2 + 𝛾 3 R 𝑠 3 +…+ 𝛾 𝑇 R 𝑠 𝑇 = −0.04+𝛾E R 1,1 +𝛾R 𝑠 2 + 𝛾 2 R 𝑠 3 +…+ 𝛾 𝑇−1 R 𝑠 𝑇 = −0.04+𝛾U 1,1

U 𝑠 0 =E 𝑈 ℎ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 . If we combine the results from the previous slides, we get: U 1,1 =E 𝑈 ℎ 1,1 , 𝑠 1 ,…, 𝑠 𝑇 =0.8∗ ∗ − ∗E 𝑈 ℎ 1,1 , 1,1 , 𝑠 2 ,…, 𝑠 𝑇 =0.8∗ ∗ − ∗(−0.04+𝛾U 1,1 ) This is an equation with one unknown, U 1,1 . We can solve as:

We have shown that, if the optimal action for state 1,1 is "up", then U 1,1 =0.648. Using the exact same approach, we can measure U 1,1 under the other three assumptions: That the optimal action for state 1,1 is "down". That the optimal action for state 1,1 is "left". That the optimal action for state 1,1 is "right". If we compute the four values of U 1,1 , obtained under each of the four assumptions, then what can we conclude?

We have shown that, if the optimal action for state 1,1 is "up", then U 1,1 =0.648. Using the exact same approach, we can measure U 1,1 under the other three assumptions: That the optimal action for state 1,1 is "down". That the optimal action for state 1,1 is "left". That the optimal action for state 1,1 is "right". If we compute the four values of U 1,1 , obtained under each of the four assumptions, then what can we conclude? The optimal action for 1,1 is the action that leads to the highest of the four values. The true value of U 1,1 is the highest of those four values. If we do the calculations, "up" is the optimal action.

In other words, what is the expected total reward between now and the end of the mission, if the current position is 1,1 ? What is 𝜋 ∗ 1,1 ? In other words, what is the optimal action to take at state 1,1 ? We computed that: U 1,1 =0.648. 𝜋 ∗ 1,1 ="up". This problem was as simplified as possible, and it still took a significant amount of calculations to solve. We even skipped most of the calculations, for the hypotheses that the action is "down", "left", and "right". Our next goal is to identify algorithms for solving such problems.

We want to identify general methods for computing: The utility of all states. The optimal policy 𝜋 ∗ , which specifies for each state 𝑠 the optimal action 𝜋 ∗ (s). To do that, we will revisit our solution for state 1,1 , and we will reformulate that solution in a way that is easier to generalize.

Utility of a State U 𝑠 = max 𝑎∈𝐴(𝑠) U(𝑠,𝑎)
+1 START -1 1 2 We computed U 1,1 by: Computing, for each possible action a that we can take at state 1,1 , the value of U 1,1 under the assumption that that action is optimal. Choosing the maximum of those values as the true value of U 1,1 . We can generalize this approach. First, some notation: Define 𝐴(𝑠) to be the set of all actions that the agent can take at state 𝑠. Define U 𝑠,𝑎 as the utility of state 𝑠 under the assumption that 𝜋 ∗ s =𝑎, i.e, the assumption that the best action at state 𝑠 is 𝑎. Then: U 𝑠 = max 𝑎∈𝐴(𝑠) U(𝑠,𝑎) 𝜋 ∗ 𝑠 = argmax 𝑎∈𝐴(𝑠) U(𝑠,𝑎)

+1 START -1 1 2 U 𝑠 = max 𝑎∈𝐴(𝑠) U(𝑠,𝑎) 𝜋 ∗ 𝑠 = argmax 𝑎∈𝐴(𝑠) U(𝑠,𝑎) To compute U((1,1),"up"), i.e., the value of U(1,1) under the assumption that 𝜋 ∗ 1,1 ="up", we considered all possible outcomes of the "up" action: With probability the agent gets to state 2,1 . With probability the agent gets to state 1,2 . With probability 0.1, the agent stays at state 1,1 . We computed the expected utility for each of those outcomes.

+1 START -1 1 2 U 𝑠 = max 𝑎∈𝐴(𝑠) U(𝑠,𝑎) 𝜋 ∗ 𝑠 = argmax 𝑎∈𝐴(𝑠) U(𝑠,𝑎) To compute U((1,1),"up"), i.e., the value of U(1,1) under the assumption that 𝜋 ∗ 1,1 ="up", we considered all possible outcomes of the "up" action: We computed the expected utility for each of those outcomes. U((1,1),"up") was the weighted sum of the expected utility for each outcome, using as weights the probabilities of the outcomes. Thus: U 𝑠,𝑎 = 𝑠 ′ 𝑝 𝑠 ′ 𝑠, 𝑎)E 𝑈 ℎ 𝑠, 𝑠′,…, 𝑠 𝑇

+1 START -1 1 2 U 𝑠 = max 𝑎∈𝐴(𝑠) U(𝑠,𝑎) 𝜋 ∗ 𝑠 = argmax 𝑎∈𝐴(𝑠) U(𝑠,𝑎) U 𝑠,𝑎 = 𝑠 ′ 𝑝 𝑠 ′ 𝑠, 𝑎)E 𝑈 ℎ 𝑠, 𝑠′,…, 𝑠 𝑇 Furthermore, we can decompose E 𝑈 ℎ 𝑠, 𝑠′,…, 𝑠 𝑇 as: E 𝑈 ℎ 𝑠, 𝑠′,…, 𝑠 𝑇 =𝑅 𝑠 +𝛾E 𝑈 ℎ 𝑠′,…, 𝑠 𝑇 =𝑅 𝑠 +𝛾𝑈( 𝑠 ′ ). Therefore: U 𝑠,𝑎 =𝑅 𝑠 +𝛾 𝑠 ′ 𝑝 𝑠 ′ 𝑠, 𝑎)𝑈 𝑠 ′

The Bellman Equation U 𝑠 = max 𝑎∈𝐴(𝑠) U(𝑠,𝑎)
+1 START -1 1 2 U 𝑠 = max 𝑎∈𝐴(𝑠) U(𝑠,𝑎) 𝜋 ∗ 𝑠 = argmax 𝑎∈𝐴(𝑠) U(𝑠,𝑎) U 𝑠,𝑎 =𝑅 𝑠 +𝛾 𝑠 ′ 𝑝 𝑠 ′ 𝑠, 𝑎)𝑈 𝑠 ′ Combining these equations together, we get: This equation is called the Bellman equation. U 𝑠 =𝑅 𝑠 +𝛾 max 𝑎∈𝐴(𝑠) 𝑠 ′ 𝑝 𝑠 ′ 𝑠, 𝑎)𝑈 𝑠 ′

The Bellman Equation U 𝑠 =𝑅 𝑠 +𝛾 max 𝑎∈𝐴(𝑠) 𝑠 ′ 𝑝 𝑠 ′ 𝑠, 𝑎)𝑈 𝑠 ′ +1
START -1 1 2 U 𝑠 =𝑅 𝑠 +𝛾 max 𝑎∈𝐴(𝑠) 𝑠 ′ 𝑝 𝑠 ′ 𝑠, 𝑎)𝑈 𝑠 ′ For each state 𝑠, we get a Bellman equation. If our environment has 𝑁 states, we need to solve a system of 𝑁 Bellman equations. In this system of equations, there is a total of 𝑁unknowns: The 𝑁 values U 𝑠 . There is an iterative algorithm for solving this system of equations, called the value iteration algorithm.

The Value Iteration Algorithm
The value iteration algorithm computes the utility of each state for a Markov Decision Process. The algorithm takes the following inputs: The set of states 𝕊= 𝑠 1 ,…, 𝑠 𝑁 . The set A(𝑠) of actions available at each state 𝑠. The transition model 𝑝 𝑠 ′ 𝑠, 𝑎). The reward function R(𝑠) The discount factor 𝛾. 𝜀, which is the maximum error allowed in the utility of each state, in the result of the algorithm.

function ValueIteration(𝕊,A,p,R, 𝛾,𝜀) N= size of 𝕊. U′=new array of doubles, of size N. Initialize all values of U′ to 0. repeat: U= copy of array U′ δ= 0 for each state s in 𝕊: U ′ 𝑠 =𝑅 𝑠 +𝛾 max 𝑎∈𝐴 𝑠 𝑠 ′ 𝑝 𝑠 ′ 𝑠, 𝑎)U 𝑠′ if U ′ 𝑠 −U 𝑠 >δ then δ= U ′ 𝑠 −U 𝑠 until δ<𝜀(1−𝛾)/𝛾 return U

function ValueIteration(𝕊,A,p,R, 𝛾,𝜀) N= size of 𝕊. U′=new array of doubles, of size N. Initialize all values of U′ to 0. repeat: U= copy of array U′ δ= 0 for each state s in 𝕊: U ′ 𝑠 =𝑅 𝑠 +𝛾 max 𝑎∈𝐴 𝑠 𝑠 ′ 𝑝 𝑠 ′ 𝑠, 𝑎)U 𝑠′ if U ′ 𝑠 −U 𝑠 >δ then δ= U ′ 𝑠 −U 𝑠 until δ<𝜀(1−𝛾)/𝛾 return U We will skip the proof, but it can be proven that this algorithm converges to the correct solutions of the Bellman equations. Details can be found in S. Russell and P. Norvig, "Artificial Intelligence: A Modern Approach", third edition (2009), Prentice Hall.

function ValueIteration(𝕊,A,p,R, 𝛾,𝜀) N= size of 𝕊. U′=new array of doubles, of size N. Initialize all values of U′ to 0. repeat: U= copy of array U′ δ= 0 for each state s in 𝕊: U ′ 𝑠 =𝑅 𝑠 +𝛾 max 𝑎∈𝐴 𝑠 𝑠 ′ 𝑝 𝑠 ′ 𝑠, 𝑎)U 𝑠′ if U ′ 𝑠 −U 𝑠 >δ then δ= U ′ 𝑠 −U 𝑠 until δ<𝜀(1−𝛾)/𝛾 return U The main operation of this algorithm is highlighted in red. We use the Bellman equation to update values 𝑈 𝑠 using the previous estimates for those values. This update step is called a Bellman update.

function ValueIteration(𝕊,A,p,R, 𝛾,𝜀) N= size of 𝕊. U′=new array of doubles, of size N. Initialize all values of U′ to 0. repeat: U= copy of array U′ δ= 0 for each state s in 𝕊: U ′ 𝑠 =𝑅 𝑠 +𝛾 max 𝑎∈𝐴 𝑠 𝑠 ′ 𝑝 𝑠 ′ 𝑠, 𝑎)U 𝑠′ if U ′ 𝑠 −U 𝑠 >δ then δ= U ′ 𝑠 −U 𝑠 until δ<𝜀(1−𝛾)/𝛾 return U So, the value iteration algorithm can be summarized as follows: Initialize utilities of states to zero values. Repeat updating utilities of states using Bellman updates, until the estimated values converge.

A Value Iteration Example
+1 -1 START 1 2 3 4 Let's see how the value iteration algorithm works on our example. Assume: 𝑅 𝑠 =−0.04 if 𝑠 is a non-terminal state. 𝛾=0.9 We initialize all utility values to 0. 1 2 3 4 Utility Values

+1 -1 START 1 2 3 4 Let's see how the value iteration algorithm works on our example. Assume: 𝑅 𝑠 =−0.04 if 𝑠 is a non-terminal state. 𝛾=0.9 This is the result after one round of updates: The current estimate for each state 𝑠 is 𝑅 𝑠 . -0.04 +1 -1 1 2 3 4 Utility Values

+1 -1 START 1 2 3 4 Let's see how the value iteration algorithm works on our example. Assume: 𝑅 𝑠 =−0.04 if 𝑠 is a non-terminal state. 𝛾=0.9 This is the result after two rounds of updates: Information about the +1 reward reached state 3,3 . -0.08 0.67 +1 -1 1 2 3 4 Utility Values

+1 -1 START 1 2 3 4 Let's see how the value iteration algorithm works on our example. Assume: 𝑅 𝑠 =−0.04 if 𝑠 is a non-terminal state. 𝛾=0.9 This is the result after three rounds of updates: Information about the +1 reward reached more states. -0.11 0.43 0.73 +1 0.35 -1 1 2 3 4 Utility Values

+1 -1 START 1 2 3 4 Let's see how the value iteration algorithm works on our example. Assume: 𝑅 𝑠 =−0.04 if 𝑠 is a non-terminal state. 𝛾=0.9 This is the result after four rounds of updates: Information about the +1 reward reached more states. 0.25 0.57 0.78 +1 -0.14 0.43 -1 0.19 1 2 3 4 Utility Values

+1 -1 START 1 2 3 4 Let's see how the value iteration algorithm works on our example. Assume: 𝑅 𝑠 =−0.04 if 𝑠 is a non-terminal state. 𝛾=0.9 This is the result after five rounds of updates: Information about the +1 reward reached more states. 0.38 0.62 0.79 +1 0.12 0.47 -1 -0.16 0.07 0.24 -0.01 1 2 3 4 Utility Values

+1 -1 START 1 2 3 4 Let's see how the value iteration algorithm works on our example. Assume: 𝑅 𝑠 =−0.04 if 𝑠 is a non-terminal state. 𝛾=0.9 This is the result after six rounds of updates: Information about the +1 reward has reached all states. 0.45 0.64 0.79 +1 0.25 0.48 -1 0.04 0.15 0.30 0.05 1 2 3 4 Utility Values

+1 -1 START 1 2 3 4 Let's see how the value iteration algorithm works on our example. Assume: 𝑅 𝑠 =−0.04 if 𝑠 is a non-terminal state. 𝛾=0.9 This is the result after seven rounds of updates: Values keep getting updated. 0.48 0.65 0.79 +1 0.33 -1 0.16 0.21 0.32 0.09 1 2 3 4 Utility Values

+1 -1 START 1 2 3 4 Let's see how the value iteration algorithm works on our example. Assume: 𝑅 𝑠 =−0.04 if 𝑠 is a non-terminal state. 𝛾=0.9 This is the result after eight rounds of updates: Values continue changing. 0.50 0.65 0.80 +1 0.37 0.49 -1 0.23 0.34 0.11 1 2 3 4 Utility Values

+1 -1 START 1 2 3 4 Let's see how the value iteration algorithm works on our example. Assume: 𝑅 𝑠 =−0.04 if 𝑠 is a non-terminal state. 𝛾=0.9 This is the result after 13 rounds of updates: Values don't change much anymore after this round. 0.51 0.65 0.80 +1 0.40 0.49 -1 0.30 0.25 0.34 0.13 1 2 3 4 Utility Values

Computing the Optimal Policy
The value iteration algorithm computes U 𝑠 for every state 𝑠. Once we have computed all values U 𝑠 , we can get the optimal policy 𝜋 ∗ using this equation: 𝜋 ∗ 𝑠 = argmax 𝑎∈𝐴(𝑠) 𝑠′ 𝑝 𝑠 ′ 𝑠, 𝑎)𝑈 𝑠 ′ Thus, 𝜋 ∗ 𝑠 identifies the action that leads to the highest expected utility for the next state, as measured over all possible outcomes of that action. This approach is called one-step look-ahead.

+1 -1 Computing the Optimal Policy 1 2 3 4 At the bottom we see the result of the value iteration algorithm for: 𝑅 𝑠 =−0.02 if 𝑠 is a non-terminal state. 𝛾=1 How can we figure out the optimal policy based on that output? Optimal Policy 0.90 0.93 0.95 +1 0.87 0.77 -1 0.85 0.82 0.79 0.57 1 2 3 4 Utility Values

+1 -1 Computing the Optimal Policy 1 2 3 4 Consider state (2,3). What is the optimal action for that state? We must consider each action. If the action is "left", these are the possible next states: The weighted average is 0.79 Optimal Policy 0.90 0.93 0.95 +1 0.87 0.77 -1 0.85 0.82 0.79 0.57 1 2 3 4 Probability Next State Utility 0.8 (2,3) 0.77 0.1 (3,3) 0.95 (1,3) 0.79 Utility Values

+1 -1 Computing the Optimal Policy 1 2 3 4 Consider state (2,3). What is the optimal action for that state? We must consider each action. If the action is "right", these are the possible next states: The weighted average is -0.63 Optimal Policy 0.90 0.93 0.95 +1 0.87 0.77 -1 0.85 0.82 0.79 0.57 1 2 3 4 Probability Next State Utility 0.8 (2,3) -1 0.1 (3,3) 0.95 (1,3) 0.79 Utility Values

+1 -1 Computing the Optimal Policy 1 2 3 4 Consider state (2,3). What is the optimal action for that state? We must consider each action. If the action is "up", these are the possible next states: The weighted average is 0.74 Optimal Policy 0.90 0.93 0.95 +1 0.87 0.77 -1 0.85 0.82 0.79 0.57 1 2 3 4 Probability Next State Utility 0.8 (3,3) 0.95 0.1 (2,4) -1.00 (2,3) 0.77 Utility Values

+1 -1 Computing the Optimal Policy 1 2 3 4 Consider state (2,3). What is the optimal action for that state? We must consider each action. If the action is "down", these are the possible next states: The weighted average is 0.61 Optimal Policy 0.90 0.93 0.95 +1 0.87 0.77 -1 0.85 0.82 0.79 0.57 1 2 3 4 Probability Next State Utility 0.8 (1,3) 0.79 0.1 (2,4) -1.00 (2,3) 0.77 Utility Values

+1 -1 Computing the Optimal Policy 1 2 3 4 For state (2,3), action "left" led to the highest expected utility for the next state. Thus, action "left" is the best action for state (2,3). Note that choosing the best action is not always to try to move towards the best state. At state 2,3 the best action is towards the blocked square, to play it safe. Going up is risky, it has a 10% chance to lead to the -1 state. Optimal Policy 0.90 0.93 0.95 +1 0.87 0.77 -1 0.85 0.82 0.79 0.57 1 2 3 4 Utility Values

+1 -1 Computing the Optimal Policy 1 2 3 4 Here is the optimal policy for: 𝑅 𝑠 =−0.02 if 𝑠 is a non-terminal state. 𝛾=1 Note that choosing the best policy is more complicated than simply pointing to the direction of highest reward. At state 2,3 the best action is towards the blocked square, to play it safe. Going up is risky, it has a 10% chance to lead to the -1 state. Optimal Policy 0.90 0.93 0.95 +1 0.87 0.77 -1 0.85 0.82 0.79 0.57 1 2 3 4 Utility Values

The Policy Iteration Algorithm
There is an alternative algorithm for computing optimal policies, that is more efficient. Remember that, if we know the utility of each state, we can compute the optimal policy 𝜋 ∗ using: 𝜋 ∗ 𝑠 = argmax 𝑎∈𝐴(𝑠) 𝑠′ 𝑝 𝑠 ′ 𝑠, 𝑎)𝑈 𝑠 ′ However, to get the right 𝜋 ∗ 𝑠 , we don't need to know the utilities very accurately. We just need to know the utilities accurately enough, so that, for each state 𝑠, argmax chooses the right action.

This alternative algorithm for computing optimal policies is called the policy iteration algorithm. It is an iterative algorithm. Initialization: Initiate some policy 𝜋 0 with random choices for the best action at each state. Main loop: Policy evaluation: given the current policy 𝜋 𝑖 , calculate utility values 𝑈 𝜋 𝑖 (s), corresponding to the utility of each state s if the agent follows policy 𝝅 𝒊 . Policy improvement: Given current utility values 𝑈 𝜋 𝑖 (s), use one-step look-ahead to compute new policy 𝜋 𝑖+1 .

To be able to implement the policy iteration algorithm, we need to specify how to carry out each of the two steps of the main loop: Policy evaluation. Policy improvement.

The Policy Evaluation Step
Task: calculate utility values 𝑈 𝜋 𝑖 (s), corresponding to the assumption that the agent follows policy 𝝅 𝒊 . When the policy was not known, we used the Bellman equation: U 𝑠 =𝑅 𝑠 +𝛾 max 𝑎∈𝐴(𝑠) 𝑠′ 𝑝 𝑠 ′ 𝑠, 𝑎)𝑈 𝑠 ′ Now that the policy 𝜋 𝑖 is specified, we can instead use a simplified version of the Bellman equation: 𝑈 𝜋 𝑖 𝑠 =𝑅 𝑠 +𝛾 𝑠′ 𝑝 𝑠 ′ 𝑠, 𝜋 𝑖 (𝑠)) 𝑈 𝜋 𝑖 𝑠 ′ Key difference: now 𝜋 𝑖 (𝑠) specifies the action for each state 𝑠, so we do not need to look for the max over all possible actions.

𝑈 𝜋 𝑖 𝑠 =𝑅 𝑠 +𝛾 𝑠′ 𝑝 𝑠 ′ 𝑠, 𝜋 𝑖 (𝑠)) 𝑈 𝜋 𝑖 𝑠 ′ This is a linear equation. The original Bellman equation, taking the max out of all possible actions, is not a linear equation. If we have 𝑁 states, we get 𝑁 linear equations of this form, with 𝑁 unknowns. We can solve those 𝑁 linear equations in O 𝑁 3 time, using standard linear algebra methods.

For large state spaces, O 𝑁 3 is prohibitive. Alternative: do some rounds of iterations. Obviously, doing 𝐾 iterations does not guarantee that the utilities are computed correctly. Parameter 𝐾 allows us to trade speed for accuracy. Larger values lead to slower runtimes and higher accuracy. function PolicyEvaluation(𝕊, p,R, 𝛾, 𝜋 𝑖 ,𝐾,U) U 0 = copy of U for 𝒌=𝟏 to 𝑲: for each state s in 𝕊: 𝑈 𝑘 𝑠 =𝑅 𝑠 +𝛾 𝑠′ 𝑝 𝑠 ′ 𝑠, 𝜋 𝑖 (𝑠)) 𝑈 𝑘−1 𝑠 ′ return 𝑈 𝑘

For large state spaces, O 𝑁 3 is prohibitive. Alternative: do some rounds of iterations. The PolicyEvaluation function takes as argument a current estimate U. See later how the PolicyEvaluation function is called from the PolicyIteration function. function PolicyEvaluation(𝕊, p,R, 𝛾, 𝜋 𝑖 ,𝐾,U) U 0 = copy of U for 𝒌=𝟏 to 𝑲: for each state s in 𝕊: 𝑈 𝑘 𝑠 =𝑅 𝑠 +𝛾 𝑠′ 𝑝 𝑠 ′ 𝑠, 𝜋 𝑖 (𝑠)) 𝑈 𝑘−1 𝑠 ′ return 𝑈 𝑘

function PolicyIteration(𝕊,A,p,R, 𝛾,𝐾) 𝑁= size of 𝕊. U=new array of size 𝑁, all values initialized to 0 𝜋= new array of actions, of size 𝑁 Initialize all values of 𝜋 to random (but legal) actions repeat: U= PolicyEvaluation(𝕊, p,R, 𝛾,𝜋,𝐾,U) unchanged=𝐭𝐫𝐮𝐞 for each state s in 𝕊: if max 𝑎∈𝐴 𝑠 𝑠 ′ 𝑝 𝑠 ′ 𝑠, 𝑎)U 𝑠′ > 𝑠 ′ 𝑝 𝑠 ′ 𝑠,𝜋[𝑠])U 𝑠′ 𝜋 𝑠 = argmax 𝑎∈𝐴 𝑠 𝑠 ′ 𝑝 𝑠 ′ 𝑠, 𝑎)U 𝑠′ unchanged=𝐟𝐚𝐥𝐬𝐞 until unchanged==𝐭𝐫𝐮𝐞 return 𝜋

function PolicyIteration(𝕊,A,p,R, 𝛾,𝐾) 𝑁= size of 𝕊. U=new array of size 𝑁, all values initialized to 0 𝜋= new array of actions, of size 𝑁 Initialize all values of 𝜋 to random (but legal) actions repeat: U= PolicyEvaluation(𝕊, p,R, 𝛾,𝜋,𝐾,U) unchanged=𝐭𝐫𝐮𝐞 for each state s in 𝕊: if max 𝑎∈𝐴 𝑠 𝑠 ′ 𝑝 𝑠 ′ 𝑠, 𝑎)U 𝑠′ > 𝑠 ′ 𝑝 𝑠 ′ 𝑠,𝜋[𝑠])U 𝑠′ 𝜋 𝑠 = argmax 𝑎∈𝐴 𝑠 𝑠 ′ 𝑝 𝑠 ′ 𝑠, 𝑎)U 𝑠′ unchanged=𝐟𝐚𝐥𝐬𝐞 until unchanged==𝐭𝐫𝐮𝐞 return 𝜋 The main loop alternates between: Updating the utilities given the policy. Updating the policy given the utilities. The main loop exits when the policy stops changing.

Markov Decision Processes: Recap
In Markov Decision Processes: Each state has a reward 𝑅 𝑠 . Each state sequence 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑇 has a utility 𝑈 ℎ which is computed by adding the discounted rewards of all states in the sequence. An action can lead to multiple outcomes. The probability of each outcome given the state and the action is known. A policy is a function mapping states to actions. The utility of a state 𝑠 0 is the expected utility measured over all state sequences that can lead from 𝑠 0 to a terminal state, under the assumption that the agent follows the optimal policy.

Markov Decision Processes: Recap
The value iteration algorithm computes the utility of each state using an iterative approach. Once the utilities of all states have been computed, the optimal policy is defined by identifying, for each state, the action leading to the highest expected utility. The policy iteration algorithm is a more efficient alternative, at the cost of possibly losing some accuracy. It computes the optimal policy directly, without computing exact values for the utilities. Utility values are updated for a few rounds only, and not until convergence.

Markov Decision Processes

Similar presentations

Presentation on theme: "Markov Decision Processes"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Markov Decision Processes

Similar presentations

Presentation on theme: "Markov Decision Processes"— Presentation transcript:

Similar presentations

About project

Feedback