Download presentation

Presentation is loading. Please wait.

Published byJasmin Gordon Modified about 1 year ago

1
Fast approximate POMDP planning: Overcoming the curse of history! Joelle Pineau, Geoff Gordon and Sebastian Thrun, CMU Point-based value iteration: an anytime algorithm for POMDPs Workshop on Advances in Machine Learning - June, 2003

2
Joelle PineauWorkshop on Advances in Machine Learning Why use a POMDP? POMDPs provide a rich framework for sequential decision-making, which can model: –varying rewards across actions and goals –actions with random effects –uncertainty in the state of the world

3
Joelle PineauWorkshop on Advances in Machine Learning Existing applications of POMDPs –Maintenance scheduling »Puterman, 1994 –Robot navigation »Koenig & Simmons, 1995; Roy & Thrun, 1999 –Helicopter control »Bagnell & Schneider, 2001; Ng et al., 2002 –Dialogue modeling »Roy, Pineau & Thrun, 2000; Peak&Horvitz, 2000 –Preference elicitation »Boutilier, 2002

4
Joelle PineauWorkshop on Advances in Machine Learning POMDP Model POMDP is n-tuple { S, A, , T, O, R }: What goes on: s t-1 stst a t-1 atat T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function S = state set A = action set = observation set What we see: o t-1 otot What we infer: b t-1 btbt

5
Joelle PineauWorkshop on Advances in Machine Learning Understanding the belief state A belief is a probability distribution over states Where Dim(B) = |S|-1 –E.g. Let S={s 1, s 2 } P(s 1 ) 0 1

6
Joelle PineauWorkshop on Advances in Machine Learning Understanding the belief state A belief is a probability distribution over states Where Dim(B) = |S|-1 –E.g. Let S={s 1, s 2, s 3 } P(s 1 ) P(s 2 ) 0 1 1

7
Joelle PineauWorkshop on Advances in Machine Learning Understanding the belief state A belief is a probability distribution over states Where Dim(B) = |S|-1 –E.g. Let S={s 1, s 2, s 3, s 4 } P(s 1 ) P(s 2 ) P(s 3 )

8
Joelle PineauWorkshop on Advances in Machine Learning The first curse of POMDP planning The curse of dimensionality: –dimension of planning problem = # of states –related to the MDP curse of dimensionality

9
Joelle PineauWorkshop on Advances in Machine Learning POMDP value functions V(b) = expected total discounted future reward starting from b Represent V as the upper surface of a set of hyper-planes. V is piecewise-linear convex Backup operator T: V TV P(s 1 ) V(b) b

10
Joelle PineauWorkshop on Advances in Machine Learning Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, | |=2 Iteration# hyper-planes 0 1 P(s 1 ) V 0 (b) b

11
Joelle PineauWorkshop on Advances in Machine Learning Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, | |=2 Iteration# hyper-planes P(s 1 ) V 1 (b) b

12
Joelle PineauWorkshop on Advances in Machine Learning Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, | |=2 Iteration# hyper-planes P(s 1 ) V 2 (b) b

13
Joelle PineauWorkshop on Advances in Machine Learning Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, | |=2 Iteration# hyper-planes P(s 1 ) V 2 (b) b

14
Joelle PineauWorkshop on Advances in Machine Learning Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, | |=2 Iteration# hyper-planes ,348,907 P(s 1 ) V 2 (b) b

15
Joelle PineauWorkshop on Advances in Machine Learning Exact value iteration for POMDPs Simple problem: |S|=2, |A|=3, | |=2 Many hyper-planes can be pruned away P(s 1 ) V 2 (b) b Iteration# hyper-planes

16
Joelle PineauWorkshop on Advances in Machine Learning Is pruning sufficient? |S|=20, |A|=6, | |=8 Iteration# hyper-planes ????? … Not for this problem!

17
Joelle PineauWorkshop on Advances in Machine Learning Certainly not for this problem! Physiotherapy Patient room Robot home |S|=576, |A|=19, |O|=17 State Features: { RobotLocation, ReminderGoal, UserLocation, UserMotionGoal, UserStatus, UserSpeechGoal }

18
Joelle PineauWorkshop on Advances in Machine Learning The second curse of POMDP planning The curse of dimensionality: –the dimension of each hyper-plane = # of states The curse of history: –the number of hyper-planes grows exponentially with the planning horizon

19
Joelle PineauWorkshop on Advances in Machine Learning The second curse of POMDP planning The curse of dimensionality: –the dimension of each hyper-plane = # of states The curse of history: –the number of hyper-planes grows exponentially with the planning horizon Complexity of POMDP value iteration: dimensionalityhistory

20
Joelle PineauWorkshop on Advances in Machine Learning Possible approximation approaches Ignore the belief: Discretize the belief: Compress the belief: Plan for trajectories: s1s1 s0s0 s2s2 - overcomes both curses - very fast - performs poorly in high entropy beliefs [Littman et al., 1995] - overcomes the curse of history (sort of) - scales exponentially with # states [Lovejoy, 1991; Brafman 1997; Hauskrecht, 1998; Zhou&Hansen, 2001] - overcomes the curse of dimensionality [Poupart&Boutilier, 2002; Roy&Gordon, 2002] - can diminish both curses - requires restricted policy class - local minimum, small gradients [Baxter&Bartlett, 2000; Ng&Jordan, 2002]

21
Joelle PineauWorkshop on Advances in Machine Learning A new algorithm: Point-based value iteration Main idea: –Select a small set of belief points P(s 1 ) V(b) b1b1 b0b0 b2b2

22
Joelle PineauWorkshop on Advances in Machine Learning A new algorithm: Point-based value iteration Main idea: –Select a small set of belief points –Plan for those belief points only P(s 1 ) V(b) b1b1 b0b0 b2b2

23
Joelle PineauWorkshop on Advances in Machine Learning A new algorithm: Point-based value iteration Main idea: –Select a small set of belief points Focus on reachable beliefs –Plan for those belief points only P(s 1 ) V(b) b1b1 b0b0 b2b2 a,o

24
Joelle PineauWorkshop on Advances in Machine Learning A new algorithm: Point-based value iteration Main idea: –Select a small set of belief points Focus on reachable beliefs –Plan for those belief points only Learn value and its gradient P(s 1 ) V(b) b1b1 b0b0 b2b2 a,o

25
Joelle PineauWorkshop on Advances in Machine Learning Point-based value update P(s 1 ) V(b) b1b1 b0b0 b2b2

26
Joelle PineauWorkshop on Advances in Machine Learning Point-based value update Initialize the value function (…and skip ahead a few iterations) P(s 1 ) V n (b) b1b1 b0b0 b2b2

27
Joelle PineauWorkshop on Advances in Machine Learning Initialize the value function (…and skip ahead a few iterations) For each b B: Point-based value update P(s 1 ) V n (b) b

28
Joelle PineauWorkshop on Advances in Machine Learning Initialize the value function (…and skip ahead a few iterations) For each b B: –For each (a,o): Project forward b b a,o and find best value: Point-based value update P(s 1 ) V n (b) bb a1,o2 b a2,o2 b a2,o1 b a1,o1

29
Joelle PineauWorkshop on Advances in Machine Learning Initialize the value function (…and skip ahead a few iterations) For each b B: –For each (a,o): Project forward b b a,o and find best value: Point-based value update P(s 1 ) V n (b) bb a1,o2 b a2,o2 b a2,o1 b a1,o1 b a1,o1, b a2,o1 b a2,o2 b a1,o2

30
Joelle PineauWorkshop on Advances in Machine Learning Initialize the value function (…and skip ahead a few iterations) For each b B: –For each (a,o): Project forward b b a,o and find best value: –Sum over observations: Point-based value update P(s 1 ) V n (b) bb a1,o2 b a2,o2 b a2,o1 b a1,o1, b a2,o1 b a2,o2 b a1,o2 b a1,o1

31
Joelle PineauWorkshop on Advances in Machine Learning Initialize the value function (…and skip ahead a few iterations) For each b B: –For each (a,o): Project forward b b a,o and find best value: –Sum over observations: Point-based value update P(s 1 ) V n (b) b b a1,o1, b a2,o1 b a2,o2 b a1,o2

32
Joelle PineauWorkshop on Advances in Machine Learning Initialize the value function (…and skip ahead a few iterations) For each b B: –For each (a,o): Project forward b b a,o and find best value: –Sum over observations: Point-based value update P(s 1 ) V n+1 (b) b b a1 b a2

33
Joelle PineauWorkshop on Advances in Machine Learning Initialize the value function (…and skip ahead a few iterations) For each b B: –For each (a,o): Project forward b b a,o and find best value: –Sum over observations: –Max over actions: Point-based value update P(s 1 ) V n+1 (b) b b a1 b a2

34
Joelle PineauWorkshop on Advances in Machine Learning Initialize the value function (…and skip ahead a few iterations) For each b B: –For each (a,o): Project forward b b a,o and find best value: –Sum over observations: –Max over actions: Point-based value update P(s 1 ) V n+1 (b) b1b1 b2b2 b0b0

35
Joelle PineauWorkshop on Advances in Machine Learning Complexity of value update Exact UpdatePoint-based Update I - ProjectionS 2 A n S 2 A B II - SumSA n SA B 2 III - MaxSA n SAB where:S = # states n = # solution vectors at iteration n A = # actionsB = # belief points = # observations n+1

36
Joelle PineauWorkshop on Advances in Machine Learning A bound on the approximation error Bound error of the point-based backup operator. Bound depends on how densely we sample belief points. –Let be the set of reachable beliefs. –Let B be the set of belief points. Theorem: For any belief set B and any horizon n, the error of the PBVI algorithm n =||V n B -V n * || is bounded by:

37
Joelle PineauWorkshop on Advances in Machine Learning Experimental results: Lasertag domain State space = RobotPosition OpponentPosition Observable: RobotPosition - always OpponentPosition - only if same as Robot Action space = {North, South, East, West, Tag} Opponent strategy: Move away from robot w/ Pr=0.8 |S|=870, |A|=5, | |=30

38
Joelle PineauWorkshop on Advances in Machine Learning Performance of PBVI on Lasertag domain Opponent tagged 70% of trials Opponent tagged 17% of trials

39
Joelle PineauWorkshop on Advances in Machine Learning Performance on well-known POMDPs Maze33 |S|=36, |A|=5, | |=17 Hallway |S|=60, |A|=5, | |=20 Hallway2 |S|=92, |A|=5, | |=17 Reward Reward n.v Reward n.v Time(s) 0.19 n.v Time(s) 0.51 n.v Time(s) 1.44 n.v B B - n.v B %Goal %Goal 47 n.v Method QMDP Grid PBUA PBVI

40
Joelle PineauWorkshop on Advances in Machine Learning Selecting good belief points What can we learn from policy search methods? –Focus on reachable beliefs. P(s 1 ) bb a1,o2 b a2,o2 b a2,o1 b a1,o1 a2,o2a1,o2 a2,o1 a1,o1

41
Joelle PineauWorkshop on Advances in Machine Learning Selecting good belief points What can we learn from policy search methods? –Focus on reachable beliefs. How can we avoid including all reachable beliefs? –Reachability analysis considers all actions, but stochastic observation choice. P(s 1 ) bb a1,o2 b a2,o1 a1,o2 a2,o1 b a2,o2 b a1,o1

42
Joelle PineauWorkshop on Advances in Machine Learning Selecting good belief points What can we learn from policy search methods? –Focus on reachable beliefs. How can we avoid including all reachable beliefs? –Reachability analysis considers all actions, but stochastic observation choice. What can we learn from our error bound? –Select widely-spaced beliefs, rather than near-by beliefs. P(s 1 ) bb a1,o2 b a2,o1 a1,o2 a2,o1

43
Joelle PineauWorkshop on Advances in Machine Learning Validation of the belief expansion heuristic Hallway domain: |S|=60, |A|=5, | |=20

44
Joelle PineauWorkshop on Advances in Machine Learning Validation of the belief expansion heuristic Tag domain: |S|=870, |A|=5, | |=30

45
Joelle PineauWorkshop on Advances in Machine Learning The anytime PBVI algorithm Alternate between: –Growing the set of belief point (e.g. B doubles in size everytime) –Planning for those belief points Terminate when you run out of time or have a good policy.

46
Joelle PineauWorkshop on Advances in Machine Learning The anytime PBVI algorithm Alternate between: –Growing the set of belief point (e.g. B doubles in size everytime) –Planning for those belief points Terminate when you run out of time or have a good policy. Lasertag results: –13 phases: |B|=1334 –ran out of time!

47
Joelle PineauWorkshop on Advances in Machine Learning The anytime PBVI algorithm Alternate between: –Growing the set of belief point (e.g. B doubles in size everytime) –Planning for those belief points Terminate when you run out of time or have a good policy. Lasertag results: –13 phases: |B|=1334 –ran out of time! Hallway2 results: –8 phases: |B|=95 –found good policy.

48
Joelle PineauWorkshop on Advances in Machine Learning Summary POMDPs suffer from the curse of history »# of beliefs grows exponentially with the planning horizon PBVI addresses the curse of history by limiting planning to a small set of likely beliefs. Strengths of PBVI include: »anytime algorithm; »polynomial-time value updates; »bounded approximation error; »empirical results showing we can solve problems up to 870 states.

49
Joelle PineauWorkshop on Advances in Machine Learning Recent work Current hurdle to solving even larger POMDPs: PBVI complexity is O(S 2 A B + SA B 2 ) –Addressing S 2 : »Combine PBVI with belief compression techniques. But sparse transition matrices mean: S 2 S –Addressing B 2 : »Use ball-trees to structure belief points. »Find better belief selection heuristics.

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google