Download presentation
Presentation is loading. Please wait.
1
Chapter 17 – Making Complex Decisions
March 30, 2004
2
17.1 Sequential Decision Problems
Sequential decision problems vs. episodic ones Fully observable environment Stochastic actions Figure 17.1
3
Markov Decision Process (MDP)
S0. Initial state. T(s, a, s’). Transition Model. Yields a 3-D table filled with probabilities. Markovian. R(s). Reward associated with a state. Short term reward for state.
4
Policy, π π(s). Action recommended in state s. This is the solution to the MDP. π*(s). Optimal policy. Yields MEU (maximum expected utility) of environment histories. Figure 17.2
5
Utility Function (Long term reward for state)
Additive. Uh([s0, s1, …]) = R(s0) + R(s1) + … Discounted. Uh([s0, s1, …]) = R(s0) + γR(s1) + γ2R(s2) + … <= γ <= 1 γ = gamma
6
Discounted Rewards Better
However, there are problems because infinite state sequences can lead to +infinity or –infinity. Set Rmax, γ < 1 Guarantee that agent will reach goal Use average reward/step as basis for comparison γ
7
17.2 Value Iteration MEU principle. π*(s) = argmax a ∑ s’ T(s, a, s’) * U(s’) Bellman Equation. U(s) = R(s) + γ max a ∑ s’ T(s, a, s’) * U(s’) Bellman Update. Ui+1(s) = R(s) + γ max a ∑ s’ T(s, a, s’) * Ui(s’)
8
Example S0 = A T(A, move, B) = 1. T(B, move, C) = 1. R(A) = -.2
R(B) = -.2 R(C) = 1 Figure 17.4
9
Value Iteration Convergence guaranteed! Unique solution guaranteed!
10
17.3 Policy Iteration Figure 17.7
Policy Evaluation. Given a policy πi, calculate the utility of each state. Policy Improvement. Use a one step look ahead to calculate a better policy, πi+1
11
Bellman Equation (Standard Policy Iteration)
Old version. U(s) = R(s) + γ max a ∑ s’ T(s, a, s’) * U(s’) New version. Ui(s) = R(s) + γ ∑ s’ T(s, πi(s), s’) * Ui(s’) This is a linear equation. With n states, there are n3 equations to solve.
12
Bellman Update (Modified Policy Iteration)
Old method. Ui+1(s) = R(s) + γ max a ∑ s’ T(s, a, s’) * Ui(s’) New method. Ui+1(s) = R(s) + γ ∑ s’ T(s, πi(s), s’) * Ui(s’) Run k updates for an estimation of the utilities. This is called modified policy iteration
13
17.4 Partially Observable MDPs (POMDPs)
Observation Model. O(s, o). Gives the probability of perceiving observation o in state s. Belief state, b. For example <.5, .5, 0> if a 3 state environment. Mechanisms are needed to update b. Optimal action depends only on the agent’s current belief state!
14
Algorithm Given b, execute action a = π*(b). Receive observation o.
Update b. b is a continuous, multi-dimensional state. Repeat.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.