Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 17 – Making Complex Decisions

Similar presentations


Presentation on theme: "Chapter 17 – Making Complex Decisions"— Presentation transcript:

1 Chapter 17 – Making Complex Decisions
March 30, 2004

2 17.1 Sequential Decision Problems
Sequential decision problems vs. episodic ones Fully observable environment Stochastic actions Figure 17.1

3 Markov Decision Process (MDP)
S0. Initial state. T(s, a, s’). Transition Model. Yields a 3-D table filled with probabilities. Markovian. R(s). Reward associated with a state. Short term reward for state.

4 Policy, π π(s). Action recommended in state s. This is the solution to the MDP. π*(s). Optimal policy. Yields MEU (maximum expected utility) of environment histories. Figure 17.2

5 Utility Function (Long term reward for state)
Additive. Uh([s0, s1, …]) = R(s0) + R(s1) + … Discounted. Uh([s0, s1, …]) = R(s0) + γR(s1) + γ2R(s2) + … <= γ <= 1 γ = gamma

6 Discounted Rewards Better
However, there are problems because infinite state sequences can lead to +infinity or –infinity. Set Rmax, γ < 1 Guarantee that agent will reach goal Use average reward/step as basis for comparison γ

7 17.2 Value Iteration MEU principle. π*(s) = argmax a ∑ s’ T(s, a, s’) * U(s’) Bellman Equation. U(s) = R(s) + γ max a ∑ s’ T(s, a, s’) * U(s’) Bellman Update. Ui+1(s) = R(s) + γ max a ∑ s’ T(s, a, s’) * Ui(s’)

8 Example S0 = A T(A, move, B) = 1. T(B, move, C) = 1. R(A) = -.2
R(B) = -.2 R(C) = 1 Figure 17.4

9 Value Iteration Convergence guaranteed! Unique solution guaranteed!

10 17.3 Policy Iteration Figure 17.7
Policy Evaluation. Given a policy πi, calculate the utility of each state. Policy Improvement. Use a one step look ahead to calculate a better policy, πi+1

11 Bellman Equation (Standard Policy Iteration)
Old version. U(s) = R(s) + γ max a ∑ s’ T(s, a, s’) * U(s’) New version. Ui(s) = R(s) + γ ∑ s’ T(s, πi(s), s’) * Ui(s’) This is a linear equation. With n states, there are n3 equations to solve.

12 Bellman Update (Modified Policy Iteration)
Old method. Ui+1(s) = R(s) + γ max a ∑ s’ T(s, a, s’) * Ui(s’) New method. Ui+1(s) = R(s) + γ ∑ s’ T(s, πi(s), s’) * Ui(s’) Run k updates for an estimation of the utilities. This is called modified policy iteration

13 17.4 Partially Observable MDPs (POMDPs)
Observation Model. O(s, o). Gives the probability of perceiving observation o in state s. Belief state, b. For example <.5, .5, 0> if a 3 state environment. Mechanisms are needed to update b. Optimal action depends only on the agent’s current belief state!

14 Algorithm Given b, execute action a = π*(b). Receive observation o.
Update b. b is a continuous, multi-dimensional state. Repeat.


Download ppt "Chapter 17 – Making Complex Decisions"

Similar presentations


Ads by Google