Download presentation
Presentation is loading. Please wait.
1
Planning and Learning with Hidden State
Michael L. Littman AT&T LabsResearch
2
Planning and Learning with Hidden State
Outline The problem of hidden state POMDPs Planning Learning Predictive state representation 9/22/2018 Planning and Learning with Hidden State
3
Acting Intelligently…
What was that? 9/22/2018 Planning and Learning with Hidden State
4
…Means Asking for Directions (at least sometimes)
How do you build an agent that can…? take actions to gain information reason about what it doesn't know represent things it can’t see remember what’s relevant for decisions Planning & acting in complex environments Explicitly reasoning about information 9/22/2018 Planning and Learning with Hidden State
5
Planning and Learning with Hidden State
Applications Distributed systems What caused that fault? Agent negotiation, e-commerce How much does she want that playstation? Graphics, animation, life-like characters Where should the character be looking? Natural language, dialog What does the user really want? 9/22/2018 Planning and Learning with Hidden State
6
Navigation Example (Littman & Simpson 98)
Actions: right-hand rule, left-hand rule, stop Goal: stop at star Observations: convex, concave, bottleneck, win! 9/22/2018 Planning and Learning with Hidden State
7
Environments: Formal Model
Environment maps action-observation histories and current action to a probability distribution over next observation. Pr(o|h,a) observations actions 9/22/2018 Planning and Learning with Hidden State
8
Planning and Learning with Hidden State
One Model: POMDPs Partially observable Markov decision processes: finite set of states, actions (i in S, a in A) transitions from state i to state j: Taij finite set of observations (o in O) observation probability in state i: Ooii reward function: r(o) 9/22/2018 Planning and Learning with Hidden State
9
Planning and Learning with Hidden State
POMDP: Belief States h = left convex left convex left concave b(h) (1x|S|): summarizes history 9/22/2018 Planning and Learning with Hidden State
10
Belief State is “State” (Aström 65)
Can represent environment: Pr(o|h,a) = b(h) Ta Oo eT As new information a o arrives, can update: b(h a o) = b(h) Ta Oo / Pr(o|h,a) Belief state is all we need to remember. 9/22/2018 Planning and Learning with Hidden State
11
Planning and Learning with Hidden State
Planning in POMDPs Dynamic programming: value function maps state to expected future value. Choose actions to maximize sum of reward and expected value of resulting state. In POMDPs, must use belief state for state. value function maximum total expected reward from b b 9/22/2018 Planning and Learning with Hidden State
12
Planning and Learning with Hidden State
POMDP Value Functions b a i2 i1 100 -40 80 i2 i1 -100 100 -100 action b action a action a action b Pr(i2) Value function (finite horizon): finite rep. Piecewise-linear and convex (Sondik 71) animation by Thrun 9/22/2018 Planning and Learning with Hidden State
13
Functional DP V(b) = maxa (So Pr(o|a, b) g(r(o) + V’(b’))
Represent V’(b) by set of vectors A+B: { a + b | a in A, b in B}, max=U Also: time-based MDPs (Boyan and Littman 01) Nash equilibria (Kearns, Littman & Singh 01) knapsack (Csirik, Littman, Singh & Stone 01) 9/22/2018 Planning and Learning with Hidden State
14
Planning and Learning with Hidden State
Incremental Pruning Elegant, simple, fast (Cassandra, Littman & Zhang 97). Start with vectors for previous iteration Ct-1 For each a and o, compute Cta,o = {eT r(o)/k+ g Ta Oo aT | a in Ct-1} For each a, compute Cta = purge(Cta,ok … (purge(Cta,o2 Cta,o1))) Let Ct = purge(Ua Cta,o) “purge” via linear programming (Lark; White 91) 9/22/2018 Planning and Learning with Hidden State
15
Algorithmic Complexity
Incremental Pruning polynomial in Sa |Cta|, |O|, |S|, |A|, bits Witness (Kaelbling, Littman & Cassandra 98) first. Would prefer |Ct| to Sa |Cta| Can’t unless NP=RP (Littman & Cassandra 96) Empirical results surpass prior algorithms. 9/22/2018 Planning and Learning with Hidden State
16
Classification Dialog (Keim & Littman 99)
User to travel to Roma, Torino, or Merino? States: SR, ST, SM, done. Transitions to done. Actions: QC (What city?), QR, QT, QM (Going to X?), R, T, M (I think X). Observations: Yes, no (more reliable), R, T, M (T/M confusable). Objective: Reward for correct class, cost for questions. 9/22/2018 Planning and Learning with Hidden State
17
Incremental Pruning Output
SR ST SM Optimal plan varies with priors (SR = SM). 9/22/2018 Planning and Learning with Hidden State
18
Planning and Learning with Hidden State
9/22/2018 Planning and Learning with Hidden State
19
Planning and Learning with Hidden State
9/22/2018 Planning and Learning with Hidden State
20
Planning and Learning with Hidden State
9/22/2018 Planning and Learning with Hidden State
21
Planning and Learning with Hidden State
9/22/2018 Planning and Learning with Hidden State
22
Planning and Learning with Hidden State
Sometimes best to not ask for directions. 9/22/2018 Planning and Learning with Hidden State
23
Planning and Learning with Hidden State
Other Approaches Specialized for deterministic (Littman 96) Finding good memoryless policies (Littman 94) Approx. via RL (Littman, Cassandra & Kaelbling 95) Structured state spaces Expressive equivalence (Littman 97) Complexity (Littman, Goldsmith & Mundhenk 98) Planning via stochastic satisfiability (Majercik & Littman 97, 99) 9/22/2018 Planning and Learning with Hidden State
24
Learning a Model Agent tries to learn to predict how the environment will react: builds a model. Model used to plan: “indirect control”. experience → learner → model model + experience → tracker → state model + state → planner → action decisions 9/22/2018 Planning and Learning with Hidden State
25
Planning and Learning with Hidden State
Learning a POMDP Input: history (action-observation seq). Output: POMDP that “explains” the data. EM, iterative algorithm (Baum et al. 70; Chrisman 92) E: Forward-backward POMDP model state occupation probabilities M: Fractional counting 9/22/2018 Planning and Learning with Hidden State
26
EM Pitfalls Each iteration will increase data likelihood.
Local maxima. (Shatkay & Kaelbling 97; Nikovski 99) Rarely learns good model. Hidden states are truly unobservable. Can we ground a model in data? 9/22/2018 Planning and Learning with Hidden State
27
Planning and Learning with Hidden State
History Window Base prediction on recent k actions-obs. Pros: Easy to update: scroll the window. Easy to learn: keep counts Cons: “horizon effect”: approximation of env. State space grows exponentially with k 9/22/2018 Planning and Learning with Hidden State
28
Planning and Learning with Hidden State
Best of Both? Is it possible to have a representation grounded in actions and observations and as expressive as POMDPs? ? 9/22/2018 Planning and Learning with Hidden State
29
Predictions as State Idea: Key information from distance past, but never too far in the future. start at blue: down red (left red)odd up __? history: forget up blue left not red predict: up blue? left red? up not blue left not red up blue left red up not blue left red 9/22/2018 Planning and Learning with Hidden State
30
What’s a Test? Test: t = a1 o1 a2 o2 … al ol (closed loop)
Test outcome prediction: Probability of observations given actions. Pr(o1 o2 … ol|h, a1 a2 … al) = Pr(t|h) Prediction vector (1x|Q|) for a set Q of tests: Pr(Q|h) = (Pr(t1|h), Pr(t2|h), … Pr(t|Q||h)) abbreviated p(h) 9/22/2018 Planning and Learning with Hidden State
31
PSR Definition (Littman, Sutton & Singh 02)
Predictive state representation (PSR): set Q of tests predictions are a sufficient statistic Pr(t|h) = ft(Pr(Q|h)) = ft(p(h)) Find any test outcome via prediction vector Linear PSR (mt is 1x|Q|): Pr(t|h) = ft(p(h)) = p(h) mtT 9/22/2018 Planning and Learning with Hidden State
32
Planning and Learning with Hidden State
Recursive Updating As new information a o arrives, can update: For each t in Q: Pr(t|h a o) = Pr(o t|h a) / Pr(o|h a) = faot(p(h)) / fao(p(h)) = p(h) maotT / p(h) maoT (linear PSR) Matrix form: p(h a o) = p(h) MaoT / p(h) maoT 9/22/2018 Planning and Learning with Hidden State
33
Planning and Learning with Hidden State
Connection to POMDPs Quite similar. Let’s connect them… Will show every POMDP has a PSR. Outcome u(t): prediction of t from all states. Test t independent of set of tests Q: u(t) linearly independent of u(Q). 9/22/2018 Planning and Learning with Hidden State
34
Planning and Learning with Hidden State
Float/Reset POMDP Float: Random walk. Reset: Far right, observe 1 if already there. u(R 1) = u(F 0) = u(F 0 R 1) = u(R 0 R 1) = 9/22/2018 Planning and Learning with Hidden State
35
Planning and Learning with Hidden State
Linear PSR from POMDP Given a POMDP rep, how can we pick tests that are a sufficient statistic? search Start with Q={} While there is some t ∈ Q, such that some a o t independent of Q, add a o t to Q Else terminate, return Q. 9/22/2018 Planning and Learning with Hidden State
36
Planning and Learning with Hidden State
Properties of search search terminates: Q contains no more than |S| tests. No test in Q is longer than |S|. All tests dependent on Q. MaoT = u(Q)+ Ta Oo u(Q) (u(Q)s cancel). Runs in polynomial time, captures POMDP. 9/22/2018 Planning and Learning with Hidden State
37
Planning and Learning with Hidden State
Float/Reset Updates F 0 F 0 R 0 F 0 R 0 F 0 R 0 F 0 F 0 R 0 F 0 F 0 R 0 F 0 F 0 F 0 R 0 F 0 F 0 F 0 R R F 0 R F 0 F 0 R 0 Tests in Q update on F 0 Interestingly: R 1; F 0 R 1 (non-linear) PSR … 9/22/2018 Planning and Learning with Hidden State
38
Planning in PSR V(p) = maxa (So Pr(o|a, p) g(r(o) + V’(p’))
Let V’(p) = maxa p aT (pw linear & convex) Recall p’ = p MaoQT / p maoT: linear PSR So, V(p) = maxa maxa p (So g(maoT r(o)+MaoT aT)) = maxb p bT Incremental pruning works Sampling or function approximation, also. 9/22/2018 Planning and Learning with Hidden State
39
Learning in a Linear PSR
Need to learn maot for each test t in Q Use estimates to produce sequence of ps … F 0 F 0 R 0 R 1 F 0 R 1 F 0 F 0 … mR1F0R0 0 (IS-weighted) mR1F0 1 (IS-weighted) mR1R0 no target p Simple via linear update rule (delta rule). gradient-like, no improvement theorem known. 9/22/2018 Planning and Learning with Hidden State
40
Planning and Learning with Hidden State
Contributions Formulate important problems as POMDPs Plan via exact and approximate methods Solutions factor in cost of information Predictions are useful as state Rivest & Schapire (1987): deterministic envs Jaeger (1999): stochastic, no actions stochastic environments with actions As expressive as POMDPs with belief states, even with linear predictions. 9/22/2018 Planning and Learning with Hidden State
41
Planning and Learning with Hidden State
Ongoing Work: PSRs Learning PSRs (with Singh, Stone, Sutton) Does it work? Can TD help? How choose tests? Connection between tests and options Abstraction via dropping tests Key tests useful for state and reward 9/22/2018 Planning and Learning with Hidden State
42
Learning in Float-Reset
9/22/2018 Planning and Learning with Hidden State
43
Value Iteration for POMDPs
Find set of vectors (linear functions) to represent the VF. for t = 1 to k find minimum covering Ct via vectors in Ct-1 One-pass (Sondik 71): Search constrained regions Exhaustive (Monahan 82; Lark 91): Enumerate vectors Linear support (Cheng 88): Enumerate vertices Witness (Littman, Cassandra, Kaelbling 95): Cover by actions. Incremental Pruning (Cassandra, Littman, Zhang 97): Enumerate, periodic removal of redundant vectors 9/22/2018 Planning and Learning with Hidden State
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.