Presentation is loading. Please wait.

Presentation is loading. Please wait.

主講人:虞台文 大同大學資工所 智慧型多媒體研究室

Similar presentations


Presentation on theme: "主講人:虞台文 大同大學資工所 智慧型多媒體研究室"— Presentation transcript:

1 主講人:虞台文 大同大學資工所 智慧型多媒體研究室
Reinforcement Learning Partially Observable Markov Decision Processes (POMDP) 主講人:虞台文 大同大學資工所 智慧型多媒體研究室

2 Content Introduction Value iteration for MDP
Belief States & Infinite-State MDP Value Function of POMDP The PWLC Property of Value Function

3 Introduction 大同大學資工所 智慧型多媒體研究室
Reinforcement Learning Partially Observable Markov Decision Processes (POMDP) Introduction 大同大學資工所 智慧型多媒體研究室

4 Definition  MDP A Markov decision process is a tuple
S  a finite set of states of the world A  a finite set of actions T: SA (S)  state-transition function R: SA R  the reward function

5 Complete Observability
Solution procedures for MDPs give values or policies for each state. Use of these solutions requires that the agent is able to detect the state it is currently in with complete reliability. Therefore, it is called CO-MDP (completely observable)

6 Partial Observability
Instead of directly measuring the current state, the agent makes an observation to get a hint about what state it is in. How to get hint (guess the state)? To do an action and take an observation. The observation can be probabilistic, i.e., it provides hint only. The ‘state’ will be defined in probability sense.

7 Observation Model  a finite set of observations the agent
can experience of its world. The probability of getting observation o given that the agent took action a and landed in state s’.

8 Definition  POMDP A POMDP is a tuple describes an MDP.
is the observation function. How to find optimal policy in such an environment?

9 Value Iteration for MDP 大同大學資工所 智慧型多媒體研究室
Reinforcement Learning Partially Observable Markov Decision Processes (POMDP) Value Iteration for MDP 大同大學資工所 智慧型多媒體研究室

10 Acting Optimality Finite-Horizon Model
Are there any difference on the nature of their optimal policies? Acting Optimality Finite-Horizon Model Maximize the expected total reward of the next k steps. Infinite-Horizon Discounted Model Maximize the expected discounted total reward.

11 Stationary vs. Non-Stationary Policies
Finite-Horizon Model The optimal policy is dependent on the number of time steps remained. Use non-stationary policy Infinite-Horizon Discounted Model The optimal policy is independent on the number of time steps remained. Use stationary policy

12 Stationary vs. Non-Stationary Policies
Finite-Horizon Model The optimal policy is dependent on the number of time steps remained. The remained time steps. Use non-stationary policy Infinite-Horizon Discounted Model The optimal policy is independent on the number of time steps remained. Use stationary policy

13 Value Functions Finite-Horizon Model Infinite-Horizon Discounted Model
Non-stationary policy Infinite-Horizon Discounted Model Stationary policy

14 Optimal Policies Finite-Horizon Model
Non-stationary policy Infinite-Horizon Discounted Model Stationary policy

15 Optimal Policies Finite-Horizon Model
Non-stationary policy Infinite-Horizon Discounted Model Stationary policy

16 Optimal Policies Finite-Horizon Model How about t  ?
Non-stationary policy To find an optimal policy, do we need to pay infinite time? How about t  ? How about Vt(s)  Vt1(s) s? How about t if Vt(s)  Vt1(s) s?

17 The MDP has finite number of states.
Value Iteration

18 Belief States & Infinite-State MDP 大同大學資工所 智慧型多媒體研究室
Reinforcement Learning Partially Observable Markov Decision Processes (POMDP) Belief States & Infinite-State MDP 大同大學資工所 智慧型多媒體研究室

19  POMDP Framework Agent SE World (MDP) b action observation
belief state b Agent SE SE: state estimator

20 Belief States There are uncountably infinite number of belief states.

21 State Space There are uncountably infinite number of belief states.
1 3-state POMDP 1 2-state POMDP

22 State Estimation bt+1=? Given bt, at and ot+1, State estimation:
There are uncountably infinite number of belief states. State estimation: bt+1=? Given bt, at and ot+1,

23 State Estimation Normalization Factor

24 State Estimation Remember these. Normalization Factor

25 State Estimation It is linear w.r.t bt Normalization Factor

26 State Transition Function
b’ b a It is linear w.r.t bt

27 State Transition Function
Suppose that It is linear w.r.t bt

28 POMDP = Infinite-State MDP
What is the reward function? POMDP = Infinite-State MDP A POMDP is an MDP with tuple B  a set of Belief states A  the finite set of actions (the same as the original MDP)  : BA (B)  state-transition function  : BA R  the reward function

29 Reward Function Good news: It is Linear. The reward function of
the original MDP Good news: It is Linear.

30 Value Function of POMDP 大同大學資工所 智慧型多媒體研究室
Reinforcement Learning Partially Observable Markov Decision Processes (POMDP) Value Function of POMDP 大同大學資工所 智慧型多媒體研究室

31 Value Function over Belief Space
Consider a 2-state POMDP: V(b) b 1 How to obtain the value function in belief space? Can we use the table-based method?

32 Finding Optimal Policy
POMDP = Infinite-State MDP The general method of MDP: To determine the value function and, then followed by policy improvement. Value functions State value function Action value function

33 Review  Value Iteration
What is Review  Value Iteration Based on finite-horizon value function. It finds on each iteration.

34 The and Immediate Reward

35 The and Consider a 2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3). a2 a1 a2 b 1 b 1 a1

36 Horizon-1 Policy Trees a2 a1 b
Consider a 2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3). a1 a2 b 1 P1 a2 a1

37 It is piecewise linear and convex.
Horizon-1 Policy Trees It is piecewise linear and convex. (PWLC) Consider a 2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3). b 1 P1 a2 a1

38 The and How about 3-state POMDP and more? It is PWLC. s1 s2
(0,0) (1,0) What is the policy?

39 The and How about 3-state POMDP and more? What is the policy?

40 The PWLC A Piecewise Linear function consists of linear, or hyperplane segments Linear function: kth linear segment: the -vector: each segment could be represented as

41 The PWLC A Piecewise Linear function consists of linear, or hyperplane segments Linear function: kth linear segment: the -vector: each segment could be represented as

42 The and Value of observation o for doing action a
on the current stat b. Immediate reward Prob. of observation o for doing action a on the current stat b.

43 The and PWLC PWLC? Yes, it is. But, I will defer the proof.
Value of observation o for doing action a on the current stat b. Immediate reward PWLC? Prob. of observation o for doing action a on the current stat b. Yes, it is. But, I will defer the proof.

44 The and

45 Compute The and b 1 a1 a2 1 a1 b’ o2 o3 o1

46 Compute The and What action will you take if the observation is oi after a1 is taken? b 1 a1 a2 1 a1 b’ o2 o3 o1

47 The and Consider individual observation (o) after action (a) is taken. Define

48 The and 1 a1 a2 1 a1 1 a1 a2 Transformed value function

49 The and 1 a1 1 1 1

50 The and o1 o2 o3 1 a1 1 1 1

51 Horizon-2 Tree for Action 1
P1

52 Horizon-2 Tree for Action 1

53 The and

54 The and a1 a2

55 Horizon-2 Policy Tree P2 P1 a1 a2 o1 o2 o3
Can you figure out How to determine the value function for horizon 3 from the above discussion?

56 The and a1 a1 a2 a2

57 The and

58 The and How for and ?

59 Horizon-3 Policy Tree o1 o2 o3 P1 P2 P3

60 The PWLC Property of Value Function 大同大學資工所 智慧型多媒體研究室
Reinforcement Learning Partially Observable Markov Decision Processes (POMDP) The PWLC Property of Value Function 大同大學資工所 智慧型多媒體研究室

61 Value Function for POMDP

62 Value Function for POMDP
Let

63 Value Function for POMDP
Let Let

64 Theorem is PWLC.

65 Proof By induction is PWLC. is true. is also true. must be true.
We already know is true. Assume is also true. We then show must be true.

66 Proof From the assumption, we have

67 Proof Let

68 Proof Let

69 is PWLC. Proof Let


Download ppt "主講人:虞台文 大同大學資工所 智慧型多媒體研究室"

Similar presentations


Ads by Google