Presentation is loading. Please wait.

Presentation is loading. Please wait.

Planning and Learning with Hidden State

Similar presentations


Presentation on theme: "Planning and Learning with Hidden State"— Presentation transcript:

1 Planning and Learning with Hidden State
Michael L. Littman AT&T LabsResearch

2 Planning and Learning with Hidden State
Outline The problem of hidden state POMDPs Planning Learning Predictive state representation 9/22/2018 Planning and Learning with Hidden State

3 Acting Intelligently…
What was that? 9/22/2018 Planning and Learning with Hidden State

4 …Means Asking for Directions (at least sometimes)
How do you build an agent that can…? take actions to gain information reason about what it doesn't know represent things it can’t see remember what’s relevant for decisions Planning & acting in complex environments Explicitly reasoning about information 9/22/2018 Planning and Learning with Hidden State

5 Planning and Learning with Hidden State
Applications Distributed systems What caused that fault? Agent negotiation, e-commerce How much does she want that playstation? Graphics, animation, life-like characters Where should the character be looking? Natural language, dialog What does the user really want? 9/22/2018 Planning and Learning with Hidden State

6 Navigation Example (Littman & Simpson 98)
Actions: right-hand rule, left-hand rule, stop Goal: stop at star Observations: convex, concave, bottleneck, win! 9/22/2018 Planning and Learning with Hidden State

7 Environments: Formal Model
Environment maps action-observation histories and current action to a probability distribution over next observation. Pr(o|h,a) observations actions 9/22/2018 Planning and Learning with Hidden State

8 Planning and Learning with Hidden State
One Model: POMDPs Partially observable Markov decision processes: finite set of states, actions (i in S, a in A) transitions from state i to state j: Taij finite set of observations (o in O) observation probability in state i: Ooii reward function: r(o) 9/22/2018 Planning and Learning with Hidden State

9 Planning and Learning with Hidden State
POMDP: Belief States h = left convex left convex left concave b(h) (1x|S|): summarizes history 9/22/2018 Planning and Learning with Hidden State

10 Belief State is “State” (Aström 65)
Can represent environment: Pr(o|h,a) = b(h) Ta Oo eT As new information a o arrives, can update: b(h a o) = b(h) Ta Oo / Pr(o|h,a) Belief state is all we need to remember. 9/22/2018 Planning and Learning with Hidden State

11 Planning and Learning with Hidden State
Planning in POMDPs Dynamic programming: value function maps state to expected future value. Choose actions to maximize sum of reward and expected value of resulting state. In POMDPs, must use belief state for state. value function maximum total expected reward from b b 9/22/2018 Planning and Learning with Hidden State

12 Planning and Learning with Hidden State
POMDP Value Functions b a i2 i1 100 -40 80 i2 i1 -100 100 -100 action b action a action a action b Pr(i2) Value function (finite horizon): finite rep. Piecewise-linear and convex (Sondik 71) animation by Thrun 9/22/2018 Planning and Learning with Hidden State

13 Functional DP V(b) = maxa (So Pr(o|a, b) g(r(o) + V’(b’))
Represent V’(b) by set of vectors A+B: { a + b | a in A, b in B}, max=U Also: time-based MDPs (Boyan and Littman 01) Nash equilibria (Kearns, Littman & Singh 01) knapsack (Csirik, Littman, Singh & Stone 01) 9/22/2018 Planning and Learning with Hidden State

14 Planning and Learning with Hidden State
Incremental Pruning Elegant, simple, fast (Cassandra, Littman & Zhang 97). Start with vectors for previous iteration Ct-1 For each a and o, compute Cta,o = {eT r(o)/k+ g Ta Oo aT | a in Ct-1} For each a, compute Cta = purge(Cta,ok  … (purge(Cta,o2  Cta,o1))) Let Ct = purge(Ua Cta,o) “purge” via linear programming (Lark; White 91) 9/22/2018 Planning and Learning with Hidden State

15 Algorithmic Complexity
Incremental Pruning polynomial in Sa |Cta|, |O|, |S|, |A|, bits Witness (Kaelbling, Littman & Cassandra 98) first. Would prefer |Ct| to Sa |Cta| Can’t unless NP=RP (Littman & Cassandra 96) Empirical results surpass prior algorithms. 9/22/2018 Planning and Learning with Hidden State

16 Classification Dialog (Keim & Littman 99)
User to travel to Roma, Torino, or Merino? States: SR, ST, SM, done. Transitions to done. Actions: QC (What city?), QR, QT, QM (Going to X?), R, T, M (I think X). Observations: Yes, no (more reliable), R, T, M (T/M confusable). Objective: Reward for correct class, cost for questions. 9/22/2018 Planning and Learning with Hidden State

17 Incremental Pruning Output
SR ST SM Optimal plan varies with priors (SR = SM). 9/22/2018 Planning and Learning with Hidden State

18 Planning and Learning with Hidden State
9/22/2018 Planning and Learning with Hidden State

19 Planning and Learning with Hidden State
9/22/2018 Planning and Learning with Hidden State

20 Planning and Learning with Hidden State
9/22/2018 Planning and Learning with Hidden State

21 Planning and Learning with Hidden State
9/22/2018 Planning and Learning with Hidden State

22 Planning and Learning with Hidden State
Sometimes best to not ask for directions. 9/22/2018 Planning and Learning with Hidden State

23 Planning and Learning with Hidden State
Other Approaches Specialized for deterministic (Littman 96) Finding good memoryless policies (Littman 94) Approx. via RL (Littman, Cassandra & Kaelbling 95) Structured state spaces Expressive equivalence (Littman 97) Complexity (Littman, Goldsmith & Mundhenk 98) Planning via stochastic satisfiability (Majercik & Littman 97, 99) 9/22/2018 Planning and Learning with Hidden State

24 Learning a Model Agent tries to learn to predict how the environment will react: builds a model. Model used to plan: “indirect control”. experience → learner → model model + experience → tracker → state model + state → planner → action decisions 9/22/2018 Planning and Learning with Hidden State

25 Planning and Learning with Hidden State
Learning a POMDP Input: history (action-observation seq). Output: POMDP that “explains” the data. EM, iterative algorithm (Baum et al. 70; Chrisman 92) E: Forward-backward POMDP model state occupation probabilities M: Fractional counting 9/22/2018 Planning and Learning with Hidden State

26 EM Pitfalls Each iteration will increase data likelihood.
Local maxima. (Shatkay & Kaelbling 97; Nikovski 99) Rarely learns good model. Hidden states are truly unobservable. Can we ground a model in data? 9/22/2018 Planning and Learning with Hidden State

27 Planning and Learning with Hidden State
History Window Base prediction on recent k actions-obs. Pros: Easy to update: scroll the window. Easy to learn: keep counts Cons: “horizon effect”: approximation of env. State space grows exponentially with k 9/22/2018 Planning and Learning with Hidden State

28 Planning and Learning with Hidden State
Best of Both? Is it possible to have a representation grounded in actions and observations and as expressive as POMDPs? ? 9/22/2018 Planning and Learning with Hidden State

29 Predictions as State Idea: Key information from distance past, but never too far in the future. start at blue: down red (left red)odd up __? history: forget up blue left not red predict: up blue? left red? up not blue left not red up blue left red up not blue left red 9/22/2018 Planning and Learning with Hidden State

30 What’s a Test? Test: t = a1 o1 a2 o2 … al ol (closed loop)
Test outcome prediction: Probability of observations given actions. Pr(o1 o2 … ol|h, a1 a2 … al) = Pr(t|h) Prediction vector (1x|Q|) for a set Q of tests: Pr(Q|h) = (Pr(t1|h), Pr(t2|h), … Pr(t|Q||h)) abbreviated p(h) 9/22/2018 Planning and Learning with Hidden State

31 PSR Definition (Littman, Sutton & Singh 02)
Predictive state representation (PSR): set Q of tests predictions are a sufficient statistic Pr(t|h) = ft(Pr(Q|h)) = ft(p(h)) Find any test outcome via prediction vector Linear PSR (mt is 1x|Q|): Pr(t|h) = ft(p(h)) = p(h) mtT 9/22/2018 Planning and Learning with Hidden State

32 Planning and Learning with Hidden State
Recursive Updating As new information a o arrives, can update: For each t in Q: Pr(t|h a o) = Pr(o t|h a) / Pr(o|h a) = faot(p(h)) / fao(p(h)) = p(h) maotT / p(h) maoT (linear PSR) Matrix form: p(h a o) = p(h) MaoT / p(h) maoT 9/22/2018 Planning and Learning with Hidden State

33 Planning and Learning with Hidden State
Connection to POMDPs Quite similar. Let’s connect them… Will show every POMDP has a PSR. Outcome u(t): prediction of t from all states. Test t independent of set of tests Q: u(t) linearly independent of u(Q). 9/22/2018 Planning and Learning with Hidden State

34 Planning and Learning with Hidden State
Float/Reset POMDP Float: Random walk. Reset: Far right, observe 1 if already there. u(R 1) = u(F 0) = u(F 0 R 1) = u(R 0 R 1) = 9/22/2018 Planning and Learning with Hidden State

35 Planning and Learning with Hidden State
Linear PSR from POMDP Given a POMDP rep, how can we pick tests that are a sufficient statistic? search Start with Q={} While there is some t ∈ Q, such that some a o t independent of Q, add a o t to Q Else terminate, return Q. 9/22/2018 Planning and Learning with Hidden State

36 Planning and Learning with Hidden State
Properties of search search terminates: Q contains no more than |S| tests. No test in Q is longer than |S|. All tests dependent on Q. MaoT = u(Q)+ Ta Oo u(Q) (u(Q)s cancel). Runs in polynomial time, captures POMDP. 9/22/2018 Planning and Learning with Hidden State

37 Planning and Learning with Hidden State
Float/Reset Updates F 0 F 0 R 0 F 0 R 0 F 0 R 0 F 0 F 0 R 0 F 0 F 0 R 0 F 0 F 0 F 0 R 0 F 0 F 0 F 0 R R F 0 R F 0 F 0 R 0 Tests in Q update on F 0 Interestingly: R 1; F 0 R 1 (non-linear) PSR 9/22/2018 Planning and Learning with Hidden State

38 Planning in PSR V(p) = maxa (So Pr(o|a, p) g(r(o) + V’(p’))
Let V’(p) = maxa p aT (pw linear & convex) Recall p’ = p MaoQT / p maoT: linear PSR So, V(p) = maxa maxa p (So g(maoT r(o)+MaoT aT)) = maxb p bT Incremental pruning works Sampling or function approximation, also. 9/22/2018 Planning and Learning with Hidden State

39 Learning in a Linear PSR
Need to learn maot for each test t in Q Use estimates to produce sequence of ps … F 0 F 0 R 0 R 1 F 0 R 1 F 0 F 0 … mR1F0R0  0 (IS-weighted) mR1F0  1 (IS-weighted) mR1R0  no target p Simple via linear update rule (delta rule). gradient-like, no improvement theorem known. 9/22/2018 Planning and Learning with Hidden State

40 Planning and Learning with Hidden State
Contributions Formulate important problems as POMDPs Plan via exact and approximate methods Solutions factor in cost of information Predictions are useful as state Rivest & Schapire (1987): deterministic envs Jaeger (1999): stochastic, no actions stochastic environments with actions As expressive as POMDPs with belief states, even with linear predictions. 9/22/2018 Planning and Learning with Hidden State

41 Planning and Learning with Hidden State
Ongoing Work: PSRs Learning PSRs (with Singh, Stone, Sutton) Does it work? Can TD help? How choose tests? Connection between tests and options Abstraction via dropping tests Key tests useful for state and reward 9/22/2018 Planning and Learning with Hidden State

42 Learning in Float-Reset
9/22/2018 Planning and Learning with Hidden State

43 Value Iteration for POMDPs
Find set of vectors (linear functions) to represent the VF. for t = 1 to k find minimum covering Ct via vectors in Ct-1 One-pass (Sondik 71): Search constrained regions Exhaustive (Monahan 82; Lark 91): Enumerate vectors Linear support (Cheng 88): Enumerate vertices Witness (Littman, Cassandra, Kaelbling 95): Cover by actions. Incremental Pruning (Cassandra, Littman, Zhang 97): Enumerate, periodic removal of redundant vectors 9/22/2018 Planning and Learning with Hidden State


Download ppt "Planning and Learning with Hidden State"

Similar presentations


Ads by Google