Download presentation
Presentation is loading. Please wait.
Published byJosue Skinner Modified over 9 years ago
1
Markov Decision Processes (MDPs) read Ch 17.1-17.2 utility-based agents –goals encoded in utility function U(s), or U:S effects of actions encoded in state transition function: T:SxA S –or T:SxA pdf(S) for non-deterministic rewards/costs encoded in reward function: R:SxA Markov property: effects of actions only depend on current state, not previous history
2
the goal: maximize reward over time –long-term discounted reward –handles infinite horizon; encourages quicker achievement “plans” are encoded in policies –mappings from states to actions: :S A how to compute optimal policy * that maximizes long- term discounted reward?
4
value function V (s): expected long-term reward from starting in state s and following policy derive policy from V(s): (s)=max a A E[R(s,a)+ V(T(s, (s)))] = max p(s’|s,a)·(R+ V(s’)) optimal policy comes from optimal value function: (s)= max p(s’|s,a)·V*(s’) =
5
Bellman’s equations –(eqn 17.5) method 1: linear programming –n coupled linear equations –v1 = max(v2,v3,v4...) –v2 = max(v1,v3,v4...) –v3 = max(v1,v2,v4...) –solve for {v1,v2,v3...} using Gnu LP kit, etc. Calculating V*(s)
6
method 2: Value Iteration –initialize V(s)=0 for all states –iteratively update value of each state based on neighbors –...until convergence
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.