Presentation is loading. Please wait.

Presentation is loading. Please wait.

C.V. Education – Ben-Gurion University (Israel). –B.Sc. – 1996-1999 –M.Sc. – 2000-2002 –PhD. – 2003-2007 Work: –After B.Sc. – Software Engineer at Microsoft,

Similar presentations


Presentation on theme: "C.V. Education – Ben-Gurion University (Israel). –B.Sc. – 1996-1999 –M.Sc. – 2000-2002 –PhD. – 2003-2007 Work: –After B.Sc. – Software Engineer at Microsoft,"— Presentation transcript:

1 C.V. Education – Ben-Gurion University (Israel). –B.Sc. – 1996-1999 –M.Sc. – 2000-2002 –PhD. – 2003-2007 Work: –After B.Sc. – Software Engineer at Microsoft, Commerce Server group, Predictor component. –After M.Sc. – Software Engineer at Mercury (now a part of HP). Functional testing group.

2 Advances in Point-Based POMDP Solvers Guy Shani Ronen Brafman Solomon E. Shimony

3 Overview Agenda: –Introduce point-based POMDP solvers. –Overview recent advancements. Structure: –Background – MDPs, POMDPs. –Point-based solvers – Belief set selection. Value function computation. –Experiments.

4 Markov Decision Process - MDP Model agents in a stochastic environment. State – an encapsulation of all the relevant environment information: –Agent location –Can the agent eat monsters? –Monsters location –Gold coins location Action – affect the environment: –Moving up, down, left, right Stochastic effects – –Movement can sometime fail –Monster movements are random Reward – received for achieving goals –Collecting coins –Eating a monster

5 MDP Formal Definition Markov property – action effects depend only on the current state. MDP is defined by the tuple. S – state space A – action set tr – state transition function: tr(s,a,s’)=pr(s’|s,a) R – reward function: R(s,a)

6 Policies and Value Functions Policy – specifies an action for each state. Optimal policy – maximizes the collected rewards: –Sum: –Average: –Discounted sum: Value function – assigns a value to a state

7 Value Iteration (Bellman 1957) Dynamic programming method. Value is updated from reward states backwards. Update is known as a backup.

8 Value Iteration (Bellman 1957) Initialize – V 0 (s) = 0, n = 0 While V has not converged For each s n = n + 1 Known to converge to V * - the optimal value function. π * - the optimal policy corresponds to the optimal value function. Bellman update

9 Policy Iteration (Howard 1960) Intuition – we care about policies, not about value functions. Changes in the value function may not affect the policy. Expectation-Maximization. Expectation – fix the policy and compute its value. Maximization – change the policy to maximize the values.

10 Partial Observability Real agents cannot directly observe the state. Sensors – provide partial and noisy information about the world.

11 Partially Observable MDP - POMDP The environment is Markovian. The agent cannot directly view the state. Sensors give observations over the current state. Formal POMDP model: – – an MDP (the environment) –Ω – set of possible observations –O(a,s,o) – observation probability given action and state – pr(o|a,s).

12 Value of Information POMDPs capture the value of information. Example – we don’t know where the larger reward is – should we go and read the map? Answer – it depends on: –The difference between the rewards. –The cost of reading the map. –The accuracy of the map. POMDPs take all such considerations into account and provide an optimal policy.

13 Belief States The agent does not directly observe the environment state. Due to noisy and insufficient information, the agent has a belief over the current world state. b(s) is the probability of being at state s. τ(b,a,o) - a deterministic function computing the next belief state given action a and observation o. The agent knows its initial belief state – b 0

14 Value Function (Sondik 1973) A value function V assigns a value to a belief state b. V * - the optimal value function. V * (b) – the expected reward if the agent will behave optimally starting from belief state b. V is traditionally represented as a set of α-vectors. V(b) = max α α·b (upper envelope). α·b = ∑ s α(s)b(s) s0s0 s1s1 α1α1 b= α0α0

15 Exact Value Iteration Creates a new set of α-vectors. Exponential explosion of vectors. Dominated vectors can be pruned. (Littman et al. 1997) Pruning process is time consuming.

16 Point-Based Backups (Pineau et al. 2001) Bellman update (backup): –V n+1 (b) = max a r a ·b + γ∑ o pr(o|b,a)V n (τ(b,a,o)) Can be written using vector notation – –backup(b) = argmax a g b,a ·b –g b,a = r a + γ∑ o argmax α g α,a,o ·b –g α,a,o (s) = ∑ s’ O(a,s’,o)tr(s,a,s’)α(s’) Computes a new α-vector optimal for a specific input belief point b. Known as a point-based backup.

17 Point-based Solvers Compute a value function V over a subset B of the belief space. –Usually only reachable belief points are used. Use α-vectors to represent V. Assumption: an optimal value function over B will generalize well to other, unobserved belief points. Advantage – each vector must maximize some b in B. Dominated vectors are pruned implicitly.

18 Variations of Point-Based Solvers A number of algorithms were suggested: –PBVI ( Pineau et al. 2001) –Perseus (Spaan and Vlasis 2003) –HSVI (Smith and Simmons 2005) –PVI (Shani et al. 2006) –FSVI (Shani et al. 2007) –SCVI (Virin et al. 2007) Differences between algorithms: –Selection of B –fixed/expanding set, traversal/distance. –Computation of V – which points are updated, what is the order of backups.

19 Belief Set Selection Option 1 – expanding belief set –PBVI [Pineua et al. 2001] B 0 = {b 0 } – the initial belief state B n+1 – for each b, add an immediate successor b’ = τ(b,a,o) s.t. dist(B n,b’) is maximal. Assumption – at the limit, B will include all reachable belief states and therefore V would converge to an optimal value function.

20 Goal B Candidates b0b0

21 Belief Set Selection Option 2 – Random walk –Perseus [Spaan & Vlassis 2004] Run a number of trials beginning at b 0. –n is trial length : for i = 0 to n –a i =random action –o i =random observation –b i+1 = τ( b i, a i, o i ) B is the set of all observed belief states. Assumption – a sufficiently long exploration would visit all “important” belief points. Disadvantage – may add many “irrelevant” belief points.

22 Goal B

23 Belief Set Selection Option 3 – Heuristic Exploration Run a number of trials beginning at b 0. while stopping criterion was not reached –a i =choose action –o i =choose observation –b i+1 = τ(b i, a i, o i ) –i++ HSVI [Smith & Simmons 2005] – –Maintains a lower bound and an upper bound over the V*. –Choose best a according to the upper bound. –Choose o such that b i+1 has the largest gap between bounds.

24 Forward Search Value Iteration [Shani et al. 2007] A POMDP agent cannot directly obtain the environment state. In simulation we may assume that the environment state is available. Idea – use the simulated environment state to guide exploration in belief space.

25 a 2,o 2 a2a2 a 1,o 1 a1a1 a0a0 o i ← choose from O(a i *,s i+1,·) Forward Search Value Iteration [Shani, Brafman, Shimony 2007] a i *←best action for s i s i+1 ← choose from tr(s i,a i *,·) b i+1 ← τ(b i,a i *,o i ) POMDP belief space MDP state space b0b0 s0s0 b1b1 b2b2 b3b3 s1s1 s2s2 s3s3 a 0,o 0

26 Asynchronous Value Iteration Synchronous value iteration – improve all points in each iteration Asynchronous value iteration – improve some points more than others. Assumption: some points are more important to the value function. PBVI, Perseus – synchronous HSVI - asynchronous

27 Dual Vs. Single Value Function The standard Bellman update creates V n+1 based on V n – uses two value functions. –V n+1 (b) = max a r a ·b + γ∑ o pr(o|b,a)V n (τ(b,a,o)) Disadvantage – improvements are not used until the next iteration. Update is faster when only a single value function is improved, so each backup benefits from previous backups. –V(b) = max a r a ·b + γ∑ o pr(o|b,a)V (τ(b,a,o))

28 Dual Vs. Single Value Function Dual value function: –while V n ≠ V n+1 for each b in B –Add backup( b, V n ) to V n+1 n = n + 1 Single value function –while V still changes for each b in B –Add backup( b, V ) to V Dual value function – improvements are not used until the next iteration (n=n+1). Single value function - each backup benefits from all previous backups

29 Value Function Update PBVI – An α-vector for each belief point in B. Arbitrary order of update. Perseus – Randomly select next point to update from the points which were not yet improved. HSVI+FSVI – Over each belief state traversal, execute backups in reversed order.

30 PBVI b0b0 b4b4 b5b5 b6b6 b1b1 b2b2 b3b3 Many vectors may not participate in the upper envelope. All points are updated before a point can be updated twice (synchronous update). It is possible for a successor of a point to be updated after that point, causing slow update of values.

31 HSVI & FSVI Advantage – backups exploit previous backups on successors. b0b0 b4b4 b5b5 b6b6 b1b1 b2b2 b3b3

32 Perseus b0b0 b4b4 b5b5 b6b6 b1b1 b2b2 b3b3 Advantage –small number of vectors in each iteration. –All points are improved, but not all are updated. Disadvantage –May choose points that are slightly improved and avoid points that can be highly improved.

33 Backup Selection Perseus generates good value functions. Can we accelerate the convergence of V? Idea – choose points to backup smartly so that value function improves considerably after each backup.

34 Prioritizing Backups Update a point b where the Bellman error e(b) = HV(b) - V(b) is maximal. Well known in MDPs. Problem – unlike MDPS, after improving b, updating the error for all other points is difficult: –The list of predecessors of b cannot be computed. –A new α-vector may improve the value for more than a single belief point. Solution – recompute error over a sampled subset of B. Select point with maximal error from that set.

35 PVI [Shani, Brafman, Shimony 2006] Advantage –all backups result in value function improvement. –Backups are optimal locally. Disadvantage –HV(B) computations are expensive b0b0 b4b4 b5b5 b6b6 b1b1 b2b2 b3b3 HV(B)

36 Prioritizing Existing Algorithms PBVI – When updating V, iterate over belief points in decreasing Bellman error order. Expand B as before. Perseus – Instead of selecting points to update at random, select the point with maximal Bellman error. HSVI – Not applicable. HSVI update in reversed order can be viewed as a prioritization technique.

37 Prioritized Value Iteration Select B using Q MDP heuristic traversal. Single value function. Update the value function using a backup over the point with maximal Bellman error. Stop when all errors drop below ε.

38 Clustered Value Iteration [Virin, Shani, Shimony, Brafman, 2007] Compute a clustering of the belief space. Iterate over the clusters and backup only belief points from the current cluster. Clusters built such that a state is usually updated after its successors.

39 Value Directed Clustering Compute the MDP optimal value function. Cluster MDP states by their MDP value. Define a soft clustering over belief space: pr(b in c) = Σ s in c b(s). Iterate over the clusters by decreasing cluster value: V(c) = 1/|C| Σ s in c V(s). Update all belief points such that pr(b in c) exceeds a threshold.

40 Results – example domains ProblemCPU TimeBackups|V| Hallway2 Perseus53713051166 HSVI386637469 PVI492447430 SCVI183750618 Rock Sample 5,5 Perseus30315,352396 HSVI104671473 PVI36400381 SCVI15750341

41 Experimental Results CPU Time: HSVI vs. SCVI

42 Experimental Results CPU Time: HSVI vs. FSVI

43 Results – Backups vs ADR

44 Summary Point-based solvers are able to scale up to POMDPs with millions of states. Algorithms differ in the selection of the belief points and the order of backups. A smart order of backups can be computed using prioritization and clustering. Trial-based algorithms are an alternative. FSVI is the fastest algorithm of this family.


Download ppt "C.V. Education – Ben-Gurion University (Israel). –B.Sc. – 1996-1999 –M.Sc. – 2000-2002 –PhD. – 2003-2007 Work: –After B.Sc. – Software Engineer at Microsoft,"

Similar presentations


Ads by Google