Planning and Learning with Hidden State

Planning and Learning with Hidden State
Michael L. Littman AT&T LabsResearch

Outline The problem of hidden state POMDPs Planning Learning Predictive state representation 9/22/2018 Planning and Learning with Hidden State

Acting Intelligently…
What was that? 9/22/2018 Planning and Learning with Hidden State

…Means Asking for Directions (at least sometimes)
How do you build an agent that can…? take actions to gain information reason about what it doesn't know represent things it can’t see remember what’s relevant for decisions Planning & acting in complex environments Explicitly reasoning about information 9/22/2018 Planning and Learning with Hidden State

Applications Distributed systems What caused that fault? Agent negotiation, e-commerce How much does she want that playstation? Graphics, animation, life-like characters Where should the character be looking? Natural language, dialog What does the user really want? 9/22/2018 Planning and Learning with Hidden State

Navigation Example (Littman & Simpson 98)
Actions: right-hand rule, left-hand rule, stop Goal: stop at star Observations: convex, concave, bottleneck, win! 9/22/2018 Planning and Learning with Hidden State

Environments: Formal Model
Environment maps action-observation histories and current action to a probability distribution over next observation. Pr(o|h,a) observations actions 9/22/2018 Planning and Learning with Hidden State

One Model: POMDPs Partially observable Markov decision processes: finite set of states, actions (i in S, a in A) transitions from state i to state j: Taij finite set of observations (o in O) observation probability in state i: Ooii reward function: r(o) 9/22/2018 Planning and Learning with Hidden State

POMDP: Belief States h = left convex left convex left concave b(h) (1x|S|): summarizes history 9/22/2018 Planning and Learning with Hidden State

Belief State is “State” (Aström 65)
Can represent environment: Pr(o|h,a) = b(h) Ta Oo eT As new information a o arrives, can update: b(h a o) = b(h) Ta Oo / Pr(o|h,a) Belief state is all we need to remember. 9/22/2018 Planning and Learning with Hidden State

Planning in POMDPs Dynamic programming: value function maps state to expected future value. Choose actions to maximize sum of reward and expected value of resulting state. In POMDPs, must use belief state for state. value function maximum total expected reward from b b 9/22/2018 Planning and Learning with Hidden State

POMDP Value Functions b a i2 i1 100 -40 80 i2 i1 -100 100 -100 action b action a action a action b Pr(i2) Value function (finite horizon): finite rep. Piecewise-linear and convex (Sondik 71) animation by Thrun 9/22/2018 Planning and Learning with Hidden State

Functional DP V(b) = maxa (So Pr(o|a, b) g(r(o) + V’(b’))
Represent V’(b) by set of vectors A+B: { a + b | a in A, b in B}, max=U Also: time-based MDPs (Boyan and Littman 01) Nash equilibria (Kearns, Littman & Singh 01) knapsack (Csirik, Littman, Singh & Stone 01) 9/22/2018 Planning and Learning with Hidden State

Incremental Pruning Elegant, simple, fast (Cassandra, Littman & Zhang 97). Start with vectors for previous iteration Ct-1 For each a and o, compute Cta,o = {eT r(o)/k+ g Ta Oo aT | a in Ct-1} For each a, compute Cta = purge(Cta,ok  … (purge(Cta,o2  Cta,o1))) Let Ct = purge(Ua Cta,o) “purge” via linear programming (Lark; White 91) 9/22/2018 Planning and Learning with Hidden State

Algorithmic Complexity
Incremental Pruning polynomial in Sa |Cta|, |O|, |S|, |A|, bits Witness (Kaelbling, Littman & Cassandra 98) first. Would prefer |Ct| to Sa |Cta| Can’t unless NP=RP (Littman & Cassandra 96) Empirical results surpass prior algorithms. 9/22/2018 Planning and Learning with Hidden State

Classification Dialog (Keim & Littman 99)
User to travel to Roma, Torino, or Merino? States: SR, ST, SM, done. Transitions to done. Actions: QC (What city?), QR, QT, QM (Going to X?), R, T, M (I think X). Observations: Yes, no (more reliable), R, T, M (T/M confusable). Objective: Reward for correct class, cost for questions. 9/22/2018 Planning and Learning with Hidden State

Incremental Pruning Output
SR ST SM Optimal plan varies with priors (SR = SM). 9/22/2018 Planning and Learning with Hidden State

9/22/2018 Planning and Learning with Hidden State

Sometimes best to not ask for directions. 9/22/2018 Planning and Learning with Hidden State

Other Approaches Specialized for deterministic (Littman 96) Finding good memoryless policies (Littman 94) Approx. via RL (Littman, Cassandra & Kaelbling 95) Structured state spaces Expressive equivalence (Littman 97) Complexity (Littman, Goldsmith & Mundhenk 98) Planning via stochastic satisfiability (Majercik & Littman 97, 99) 9/22/2018 Planning and Learning with Hidden State

Learning a Model Agent tries to learn to predict how the environment will react: builds a model. Model used to plan: “indirect control”. experience → learner → model model + experience → tracker → state model + state → planner → action decisions 9/22/2018 Planning and Learning with Hidden State

Learning a POMDP Input: history (action-observation seq). Output: POMDP that “explains” the data. EM, iterative algorithm (Baum et al. 70; Chrisman 92) E: Forward-backward POMDP model state occupation probabilities M: Fractional counting 9/22/2018 Planning and Learning with Hidden State

EM Pitfalls Each iteration will increase data likelihood.
Local maxima. (Shatkay & Kaelbling 97; Nikovski 99) Rarely learns good model. Hidden states are truly unobservable. Can we ground a model in data? 9/22/2018 Planning and Learning with Hidden State

History Window Base prediction on recent k actions-obs. Pros: Easy to update: scroll the window. Easy to learn: keep counts Cons: “horizon effect”: approximation of env. State space grows exponentially with k 9/22/2018 Planning and Learning with Hidden State

Best of Both? Is it possible to have a representation grounded in actions and observations and as expressive as POMDPs? ? 9/22/2018 Planning and Learning with Hidden State

Predictions as State Idea: Key information from distance past, but never too far in the future. start at blue: down red (left red)odd up __? history: forget up blue left not red predict: up blue? left red? up not blue left not red up blue left red up not blue left red 9/22/2018 Planning and Learning with Hidden State

What’s a Test? Test: t = a1 o1 a2 o2 … al ol (closed loop)
Test outcome prediction: Probability of observations given actions. Pr(o1 o2 … ol|h, a1 a2 … al) = Pr(t|h) Prediction vector (1x|Q|) for a set Q of tests: Pr(Q|h) = (Pr(t1|h), Pr(t2|h), … Pr(t|Q||h)) abbreviated p(h) 9/22/2018 Planning and Learning with Hidden State

PSR Definition (Littman, Sutton & Singh 02)
Predictive state representation (PSR): set Q of tests predictions are a sufficient statistic Pr(t|h) = ft(Pr(Q|h)) = ft(p(h)) Find any test outcome via prediction vector Linear PSR (mt is 1x|Q|): Pr(t|h) = ft(p(h)) = p(h) mtT 9/22/2018 Planning and Learning with Hidden State

Recursive Updating As new information a o arrives, can update: For each t in Q: Pr(t|h a o) = Pr(o t|h a) / Pr(o|h a) = faot(p(h)) / fao(p(h)) = p(h) maotT / p(h) maoT (linear PSR) Matrix form: p(h a o) = p(h) MaoT / p(h) maoT 9/22/2018 Planning and Learning with Hidden State

Connection to POMDPs Quite similar. Let’s connect them… Will show every POMDP has a PSR. Outcome u(t): prediction of t from all states. Test t independent of set of tests Q: u(t) linearly independent of u(Q). 9/22/2018 Planning and Learning with Hidden State

Float/Reset POMDP Float: Random walk. Reset: Far right, observe 1 if already there. u(R 1) = u(F 0) = u(F 0 R 1) = u(R 0 R 1) = 9/22/2018 Planning and Learning with Hidden State

Linear PSR from POMDP Given a POMDP rep, how can we pick tests that are a sufficient statistic? search Start with Q={} While there is some t ∈ Q, such that some a o t independent of Q, add a o t to Q Else terminate, return Q. 9/22/2018 Planning and Learning with Hidden State

Properties of search search terminates: Q contains no more than |S| tests. No test in Q is longer than |S|. All tests dependent on Q. MaoT = u(Q)+ Ta Oo u(Q) (u(Q)s cancel). Runs in polynomial time, captures POMDP. 9/22/2018 Planning and Learning with Hidden State

Float/Reset Updates F 0 F 0 R 0 F 0 R 0 F 0 R 0 F 0 F 0 R 0 F 0 F 0 R 0 F 0 F 0 F 0 R 0 F 0 F 0 F 0 R R F 0 R F 0 F 0 R 0 Tests in Q update on F 0 Interestingly: R 1; F 0 R 1 (non-linear) PSR … 9/22/2018 Planning and Learning with Hidden State

Planning in PSR V(p) = maxa (So Pr(o|a, p) g(r(o) + V’(p’))
Let V’(p) = maxa p aT (pw linear & convex) Recall p’ = p MaoQT / p maoT: linear PSR So, V(p) = maxa maxa p (So g(maoT r(o)+MaoT aT)) = maxb p bT Incremental pruning works Sampling or function approximation, also. 9/22/2018 Planning and Learning with Hidden State

Learning in a Linear PSR
Need to learn maot for each test t in Q Use estimates to produce sequence of ps … F 0 F 0 R 0 R 1 F 0 R 1 F 0 F 0 … mR1F0R0  0 (IS-weighted) mR1F0  1 (IS-weighted) mR1R0  no target p Simple via linear update rule (delta rule). gradient-like, no improvement theorem known. 9/22/2018 Planning and Learning with Hidden State

Contributions Formulate important problems as POMDPs Plan via exact and approximate methods Solutions factor in cost of information Predictions are useful as state Rivest & Schapire (1987): deterministic envs Jaeger (1999): stochastic, no actions stochastic environments with actions As expressive as POMDPs with belief states, even with linear predictions. 9/22/2018 Planning and Learning with Hidden State

Ongoing Work: PSRs Learning PSRs (with Singh, Stone, Sutton) Does it work? Can TD help? How choose tests? Connection between tests and options Abstraction via dropping tests Key tests useful for state and reward 9/22/2018 Planning and Learning with Hidden State

Learning in Float-Reset
9/22/2018 Planning and Learning with Hidden State

Value Iteration for POMDPs
Find set of vectors (linear functions) to represent the VF. for t = 1 to k find minimum covering Ct via vectors in Ct-1 One-pass (Sondik 71): Search constrained regions Exhaustive (Monahan 82; Lark 91): Enumerate vectors Linear support (Cheng 88): Enumerate vertices Witness (Littman, Cassandra, Kaelbling 95): Cover by actions. Incremental Pruning (Cassandra, Littman, Zhang 97): Enumerate, periodic removal of redundant vectors 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State

Similar presentations

Presentation on theme: "Planning and Learning with Hidden State"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Planning and Learning with Hidden State

Similar presentations

Presentation on theme: "Planning and Learning with Hidden State"— Presentation transcript:

Similar presentations

About project

Feedback