Download presentation

Presentation is loading. Please wait.

Published byJosue Skinner Modified about 1 year ago

1
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S effects of actions encoded in state transition function: T:SxA S –or T:SxA pdf(S) for non-deterministic rewards/costs encoded in reward function: R:SxA Markov property: effects of actions only depend on current state, not previous history

2
the goal: maximize reward over time –long-term discounted reward –handles infinite horizon; encourages quicker achievement “plans” are encoded in policies –mappings from states to actions: :S A how to compute optimal policy * that maximizes long- term discounted reward?

3

4
value function V (s): expected long-term reward from starting in state s and following policy derive policy from V(s): (s)=max a A E[R(s,a)+ V(T(s, (s)))] = max p(s’|s,a)·(R+ V(s’)) optimal policy comes from optimal value function: (s)= max p(s’|s,a)·V*(s’) =

5
Bellman’s equations –(eqn 17.5) method 1: linear programming –n coupled linear equations –v1 = max(v2,v3,v4...) –v2 = max(v1,v3,v4...) –v3 = max(v1,v2,v4...) –solve for {v1,v2,v3...} using Gnu LP kit, etc. Calculating V*(s)

6
method 2: Value Iteration –initialize V(s)=0 for all states –iteratively update value of each state based on neighbors –...until convergence

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google