# Markov Decision Processes (MDPs) read Ch 17.1-17.2 utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

## Presentation on theme: "Markov Decision Processes (MDPs) read Ch 17.1-17.2 utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in."— Presentation transcript:

Markov Decision Processes (MDPs) read Ch 17.1-17.2 utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in state transition function: T:SxA  S –or T:SxA  pdf(S) for non-deterministic rewards/costs encoded in reward function: R:SxA  Markov property: effects of actions only depend on current state, not previous history

the goal: maximize reward over time –long-term discounted reward –handles infinite horizon; encourages quicker achievement “plans” are encoded in policies –mappings from states to actions:  :S  A how to compute optimal policy  * that maximizes long- term discounted reward?

value function V  (s): expected long-term reward from starting in state s and following policy  derive policy from V(s):  (s)=max a  A E[R(s,a)+  V(T(s,  (s)))] = max  p(s’|s,a)·(R+  V(s’)) optimal policy comes from optimal value function:  (s)= max  p(s’|s,a)·V*(s’) =

Bellman’s equations –(eqn 17.5) method 1: linear programming –n coupled linear equations –v1 = max(v2,v3,v4...) –v2 = max(v1,v3,v4...) –v3 = max(v1,v2,v4...) –solve for {v1,v2,v3...} using Gnu LP kit, etc. Calculating V*(s)

method 2: Value Iteration –initialize V(s)=0 for all states –iteratively update value of each state based on neighbors –...until convergence

Similar presentations