Planning and Learning with Hidden State Michael L. Littman AT&T LabsResearch
Planning and Learning with Hidden State Outline The problem of hidden state POMDPs Planning Learning Predictive state representation 9/22/2018 Planning and Learning with Hidden State
Acting Intelligently… What was that? 9/22/2018 Planning and Learning with Hidden State
…Means Asking for Directions (at least sometimes) How do you build an agent that can…? take actions to gain information reason about what it doesn't know represent things it can’t see remember what’s relevant for decisions Planning & acting in complex environments Explicitly reasoning about information 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State Applications Distributed systems What caused that fault? Agent negotiation, e-commerce How much does she want that playstation? Graphics, animation, life-like characters Where should the character be looking? Natural language, dialog What does the user really want? 9/22/2018 Planning and Learning with Hidden State
Navigation Example (Littman & Simpson 98) Actions: right-hand rule, left-hand rule, stop Goal: stop at star Observations: convex, concave, bottleneck, win! 9/22/2018 Planning and Learning with Hidden State
Environments: Formal Model Environment maps action-observation histories and current action to a probability distribution over next observation. Pr(o|h,a) observations actions 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State One Model: POMDPs Partially observable Markov decision processes: finite set of states, actions (i in S, a in A) transitions from state i to state j: Taij finite set of observations (o in O) observation probability in state i: Ooii reward function: r(o) 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State POMDP: Belief States h = left convex left convex left concave b(h) (1x|S|): summarizes history 9/22/2018 Planning and Learning with Hidden State
Belief State is “State” (Aström 65) Can represent environment: Pr(o|h,a) = b(h) Ta Oo eT As new information a o arrives, can update: b(h a o) = b(h) Ta Oo / Pr(o|h,a) Belief state is all we need to remember. 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State Planning in POMDPs Dynamic programming: value function maps state to expected future value. Choose actions to maximize sum of reward and expected value of resulting state. In POMDPs, must use belief state for state. value function maximum total expected reward from b b 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State POMDP Value Functions b a i2 i1 100 -40 80 i2 i1 -100 100 -100 action b action a action a action b Pr(i2) Value function (finite horizon): finite rep. Piecewise-linear and convex (Sondik 71) animation by Thrun 9/22/2018 Planning and Learning with Hidden State
Functional DP V(b) = maxa (So Pr(o|a, b) g(r(o) + V’(b’)) Represent V’(b) by set of vectors A+B: { a + b | a in A, b in B}, max=U Also: time-based MDPs (Boyan and Littman 01) Nash equilibria (Kearns, Littman & Singh 01) knapsack (Csirik, Littman, Singh & Stone 01) 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State Incremental Pruning Elegant, simple, fast (Cassandra, Littman & Zhang 97). Start with vectors for previous iteration Ct-1 For each a and o, compute Cta,o = {eT r(o)/k+ g Ta Oo aT | a in Ct-1} For each a, compute Cta = purge(Cta,ok … (purge(Cta,o2 Cta,o1))) Let Ct = purge(Ua Cta,o) “purge” via linear programming (Lark; White 91) 9/22/2018 Planning and Learning with Hidden State
Algorithmic Complexity Incremental Pruning polynomial in Sa |Cta|, |O|, |S|, |A|, bits Witness (Kaelbling, Littman & Cassandra 98) first. Would prefer |Ct| to Sa |Cta| Can’t unless NP=RP (Littman & Cassandra 96) Empirical results surpass prior algorithms. 9/22/2018 Planning and Learning with Hidden State
Classification Dialog (Keim & Littman 99) User to travel to Roma, Torino, or Merino? States: SR, ST, SM, done. Transitions to done. Actions: QC (What city?), QR, QT, QM (Going to X?), R, T, M (I think X). Observations: Yes, no (more reliable), R, T, M (T/M confusable). Objective: Reward for correct class, cost for questions. 9/22/2018 Planning and Learning with Hidden State
Incremental Pruning Output SR ST SM Optimal plan varies with priors (SR = SM). 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State Sometimes best to not ask for directions. 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State Other Approaches Specialized for deterministic (Littman 96) Finding good memoryless policies (Littman 94) Approx. via RL (Littman, Cassandra & Kaelbling 95) Structured state spaces Expressive equivalence (Littman 97) Complexity (Littman, Goldsmith & Mundhenk 98) Planning via stochastic satisfiability (Majercik & Littman 97, 99) 9/22/2018 Planning and Learning with Hidden State
Learning a Model Agent tries to learn to predict how the environment will react: builds a model. Model used to plan: “indirect control”. experience → learner → model model + experience → tracker → state model + state → planner → action decisions 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State Learning a POMDP Input: history (action-observation seq). Output: POMDP that “explains” the data. EM, iterative algorithm (Baum et al. 70; Chrisman 92) E: Forward-backward POMDP model state occupation probabilities M: Fractional counting 9/22/2018 Planning and Learning with Hidden State
EM Pitfalls Each iteration will increase data likelihood. Local maxima. (Shatkay & Kaelbling 97; Nikovski 99) Rarely learns good model. Hidden states are truly unobservable. Can we ground a model in data? 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State History Window Base prediction on recent k actions-obs. Pros: Easy to update: scroll the window. Easy to learn: keep counts Cons: “horizon effect”: approximation of env. State space grows exponentially with k 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State Best of Both? Is it possible to have a representation grounded in actions and observations and as expressive as POMDPs? ? 9/22/2018 Planning and Learning with Hidden State
Predictions as State Idea: Key information from distance past, but never too far in the future. start at blue: down red (left red)odd up __? history: forget up blue left not red predict: up blue? left red? up not blue left not red up blue left red up not blue left red 9/22/2018 Planning and Learning with Hidden State
What’s a Test? Test: t = a1 o1 a2 o2 … al ol (closed loop) Test outcome prediction: Probability of observations given actions. Pr(o1 o2 … ol|h, a1 a2 … al) = Pr(t|h) Prediction vector (1x|Q|) for a set Q of tests: Pr(Q|h) = (Pr(t1|h), Pr(t2|h), … Pr(t|Q||h)) abbreviated p(h) 9/22/2018 Planning and Learning with Hidden State
PSR Definition (Littman, Sutton & Singh 02) Predictive state representation (PSR): set Q of tests predictions are a sufficient statistic Pr(t|h) = ft(Pr(Q|h)) = ft(p(h)) Find any test outcome via prediction vector Linear PSR (mt is 1x|Q|): Pr(t|h) = ft(p(h)) = p(h) mtT 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State Recursive Updating As new information a o arrives, can update: For each t in Q: Pr(t|h a o) = Pr(o t|h a) / Pr(o|h a) = faot(p(h)) / fao(p(h)) = p(h) maotT / p(h) maoT (linear PSR) Matrix form: p(h a o) = p(h) MaoT / p(h) maoT 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State Connection to POMDPs Quite similar. Let’s connect them… Will show every POMDP has a PSR. Outcome u(t): prediction of t from all states. Test t independent of set of tests Q: u(t) linearly independent of u(Q). 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State Float/Reset POMDP Float: Random walk. Reset: Far right, observe 1 if already there. u(R 1) = 0 0 0 0 1 u(F 0) = 1 1 1 1 1 u(F 0 R 1) = 0 0 0 0.5 0.5 u(R 0 R 1) = 1 1 1 1 0 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State Linear PSR from POMDP Given a POMDP rep, how can we pick tests that are a sufficient statistic? search Start with Q={} While there is some t ∈ Q, such that some a o t independent of Q, add a o t to Q Else terminate, return Q. 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State Properties of search search terminates: Q contains no more than |S| tests. No test in Q is longer than |S|. All tests dependent on Q. MaoT = u(Q)+ Ta Oo u(Q) (u(Q)s cancel). Runs in polynomial time, captures POMDP. 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State Float/Reset Updates F 0 F 0 R 0 F 0 R 0 F 0 R 0 F 0 F 0 R 0 F 0 F 0 R 0 F 0 F 0 F 0 R 0 F 0 F 0 F 0 R 0 0.25 R 0 - 0.0625 F 0 R 0 + 0.750 F 0 F 0 R 0 Tests in Q update on F 0 Interestingly: R 1; F 0 R 1 (non-linear) PSR 1 .5 .5 .375 .375 .3125 .3125… 9/22/2018 Planning and Learning with Hidden State
Planning in PSR V(p) = maxa (So Pr(o|a, p) g(r(o) + V’(p’)) Let V’(p) = maxa p aT (pw linear & convex) Recall p’ = p MaoQT / p maoT: linear PSR So, V(p) = maxa maxa p (So g(maoT r(o)+MaoT aT)) = maxb p bT Incremental pruning works Sampling or function approximation, also. 9/22/2018 Planning and Learning with Hidden State
Learning in a Linear PSR Need to learn maot for each test t in Q Use estimates to produce sequence of ps … F 0 F 0 R 0 R 1 F 0 R 1 F 0 F 0 … mR1F0R0 0 (IS-weighted) mR1F0 1 (IS-weighted) mR1R0 no target p Simple via linear update rule (delta rule). gradient-like, no improvement theorem known. 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State Contributions Formulate important problems as POMDPs Plan via exact and approximate methods Solutions factor in cost of information Predictions are useful as state Rivest & Schapire (1987): deterministic envs Jaeger (1999): stochastic, no actions stochastic environments with actions As expressive as POMDPs with belief states, even with linear predictions. 9/22/2018 Planning and Learning with Hidden State
Planning and Learning with Hidden State Ongoing Work: PSRs Learning PSRs (with Singh, Stone, Sutton) Does it work? Can TD help? How choose tests? Connection between tests and options Abstraction via dropping tests Key tests useful for state and reward 9/22/2018 Planning and Learning with Hidden State
Learning in Float-Reset 9/22/2018 Planning and Learning with Hidden State
Value Iteration for POMDPs Find set of vectors (linear functions) to represent the VF. for t = 1 to k find minimum covering Ct via vectors in Ct-1 One-pass (Sondik 71): Search constrained regions Exhaustive (Monahan 82; Lark 91): Enumerate vectors Linear support (Cheng 88): Enumerate vertices Witness (Littman, Cassandra, Kaelbling 95): Cover by actions. Incremental Pruning (Cassandra, Littman, Zhang 97): Enumerate, periodic removal of redundant vectors 9/22/2018 Planning and Learning with Hidden State