Planning and Learning with Hidden State

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Partially Observable Markov Decision Process (POMDP)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Optimal Policies for POMDP Presented by Alp Sardağ.
Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授.
CS594 Automated decision making University of Illinois, Chicago
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Planning under Uncertainty
POMDPs: Partially Observable Markov Decision Processes Advanced AI
Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.
Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University.
Incremental Pruning CSE 574 May 9, 2003 Stanley Kok.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
Predictive State Representation Masoumeh Izadi School of Computer Science McGill University UdeM-McGill Machine Learning Seminar.
On-line Discovery of Temporal-Difference Networks Takaki Makino, Toshihisa Takagi Division of Project Coordination University of Tokyo 1.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
R L 3 Introduction to Planning Under Uncertainty Michael L. Littman Rutgers University Department of Computer Science Rutgers Laboratory for Real Life.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
Intro. ANN & Fuzzy Systems Lecture 14. MLP (VI): Model Selection.
CS Statistical Machine learning Lecture 24
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;
POMDP We maximize the expected cummulated reward.
Hidden Markov Models BMI/CS 576
Partially Observable Markov Decision Process and RL
POMDPs Logistics Outline No class Wed
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14
Biomedical Data & Markov Decision Process
"Playing Atari with deep reinforcement learning."
Markov Decision Processes
Robust Belief-based Execution of Manipulation Programs
Markov Decision Processes
Hidden Markov Models Part 2: Algorithms
Hierarchical POMDP Solutions
RL for Large State Spaces: Value Function Approximation
CS 188: Artificial Intelligence Fall 2007
Chapter 2: Evaluative Feedback
Approximate POMDP planning: Overcoming the curse of history!
October 6, 2011 Dr. Itamar Arel College of Engineering
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14
Markov Decision Problems
Hidden Markov Models By Manish Shrivastava.
Reinforcement Learning Dealing with Partial Observability
Chapter 2: Evaluative Feedback
Deciding Under Probabilistic Uncertainty
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Planning and Learning with Hidden State Michael L. Littman AT&T LabsResearch

Planning and Learning with Hidden State Outline The problem of hidden state POMDPs Planning Learning Predictive state representation 9/22/2018 Planning and Learning with Hidden State

Acting Intelligently… What was that? 9/22/2018 Planning and Learning with Hidden State

…Means Asking for Directions (at least sometimes) How do you build an agent that can…? take actions to gain information reason about what it doesn't know represent things it can’t see remember what’s relevant for decisions Planning & acting in complex environments Explicitly reasoning about information 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State Applications Distributed systems What caused that fault? Agent negotiation, e-commerce How much does she want that playstation? Graphics, animation, life-like characters Where should the character be looking? Natural language, dialog What does the user really want? 9/22/2018 Planning and Learning with Hidden State

Navigation Example (Littman & Simpson 98) Actions: right-hand rule, left-hand rule, stop Goal: stop at star Observations: convex, concave, bottleneck, win! 9/22/2018 Planning and Learning with Hidden State

Environments: Formal Model Environment maps action-observation histories and current action to a probability distribution over next observation. Pr(o|h,a) observations actions 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State One Model: POMDPs Partially observable Markov decision processes: finite set of states, actions (i in S, a in A) transitions from state i to state j: Taij finite set of observations (o in O) observation probability in state i: Ooii reward function: r(o) 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State POMDP: Belief States h = left convex left convex left concave b(h) (1x|S|): summarizes history 9/22/2018 Planning and Learning with Hidden State

Belief State is “State” (Aström 65) Can represent environment: Pr(o|h,a) = b(h) Ta Oo eT As new information a o arrives, can update: b(h a o) = b(h) Ta Oo / Pr(o|h,a) Belief state is all we need to remember. 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State Planning in POMDPs Dynamic programming: value function maps state to expected future value. Choose actions to maximize sum of reward and expected value of resulting state. In POMDPs, must use belief state for state. value function maximum total expected reward from b b 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State POMDP Value Functions b a i2 i1 100 -40 80 i2 i1 -100 100 -100 action b action a action a action b Pr(i2) Value function (finite horizon): finite rep. Piecewise-linear and convex (Sondik 71) animation by Thrun 9/22/2018 Planning and Learning with Hidden State

Functional DP V(b) = maxa (So Pr(o|a, b) g(r(o) + V’(b’)) Represent V’(b) by set of vectors A+B: { a + b | a in A, b in B}, max=U Also: time-based MDPs (Boyan and Littman 01) Nash equilibria (Kearns, Littman & Singh 01) knapsack (Csirik, Littman, Singh & Stone 01) 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State Incremental Pruning Elegant, simple, fast (Cassandra, Littman & Zhang 97). Start with vectors for previous iteration Ct-1 For each a and o, compute Cta,o = {eT r(o)/k+ g Ta Oo aT | a in Ct-1} For each a, compute Cta = purge(Cta,ok  … (purge(Cta,o2  Cta,o1))) Let Ct = purge(Ua Cta,o) “purge” via linear programming (Lark; White 91) 9/22/2018 Planning and Learning with Hidden State

Algorithmic Complexity Incremental Pruning polynomial in Sa |Cta|, |O|, |S|, |A|, bits Witness (Kaelbling, Littman & Cassandra 98) first. Would prefer |Ct| to Sa |Cta| Can’t unless NP=RP (Littman & Cassandra 96) Empirical results surpass prior algorithms. 9/22/2018 Planning and Learning with Hidden State

Classification Dialog (Keim & Littman 99) User to travel to Roma, Torino, or Merino? States: SR, ST, SM, done. Transitions to done. Actions: QC (What city?), QR, QT, QM (Going to X?), R, T, M (I think X). Observations: Yes, no (more reliable), R, T, M (T/M confusable). Objective: Reward for correct class, cost for questions. 9/22/2018 Planning and Learning with Hidden State

Incremental Pruning Output SR ST SM Optimal plan varies with priors (SR = SM). 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State Sometimes best to not ask for directions. 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State Other Approaches Specialized for deterministic (Littman 96) Finding good memoryless policies (Littman 94) Approx. via RL (Littman, Cassandra & Kaelbling 95) Structured state spaces Expressive equivalence (Littman 97) Complexity (Littman, Goldsmith & Mundhenk 98) Planning via stochastic satisfiability (Majercik & Littman 97, 99) 9/22/2018 Planning and Learning with Hidden State

Learning a Model Agent tries to learn to predict how the environment will react: builds a model. Model used to plan: “indirect control”. experience → learner → model model + experience → tracker → state model + state → planner → action decisions 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State Learning a POMDP Input: history (action-observation seq). Output: POMDP that “explains” the data. EM, iterative algorithm (Baum et al. 70; Chrisman 92) E: Forward-backward POMDP model state occupation probabilities M: Fractional counting 9/22/2018 Planning and Learning with Hidden State

EM Pitfalls Each iteration will increase data likelihood. Local maxima. (Shatkay & Kaelbling 97; Nikovski 99) Rarely learns good model. Hidden states are truly unobservable. Can we ground a model in data? 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State History Window Base prediction on recent k actions-obs. Pros: Easy to update: scroll the window. Easy to learn: keep counts Cons: “horizon effect”: approximation of env. State space grows exponentially with k 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State Best of Both? Is it possible to have a representation grounded in actions and observations and as expressive as POMDPs? ? 9/22/2018 Planning and Learning with Hidden State

Predictions as State Idea: Key information from distance past, but never too far in the future. start at blue: down red (left red)odd up __? history: forget up blue left not red predict: up blue? left red? up not blue left not red up blue left red up not blue left red 9/22/2018 Planning and Learning with Hidden State

What’s a Test? Test: t = a1 o1 a2 o2 … al ol (closed loop) Test outcome prediction: Probability of observations given actions. Pr(o1 o2 … ol|h, a1 a2 … al) = Pr(t|h) Prediction vector (1x|Q|) for a set Q of tests: Pr(Q|h) = (Pr(t1|h), Pr(t2|h), … Pr(t|Q||h)) abbreviated p(h) 9/22/2018 Planning and Learning with Hidden State

PSR Definition (Littman, Sutton & Singh 02) Predictive state representation (PSR): set Q of tests predictions are a sufficient statistic Pr(t|h) = ft(Pr(Q|h)) = ft(p(h)) Find any test outcome via prediction vector Linear PSR (mt is 1x|Q|): Pr(t|h) = ft(p(h)) = p(h) mtT 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State Recursive Updating As new information a o arrives, can update: For each t in Q: Pr(t|h a o) = Pr(o t|h a) / Pr(o|h a) = faot(p(h)) / fao(p(h)) = p(h) maotT / p(h) maoT (linear PSR) Matrix form: p(h a o) = p(h) MaoT / p(h) maoT 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State Connection to POMDPs Quite similar. Let’s connect them… Will show every POMDP has a PSR. Outcome u(t): prediction of t from all states. Test t independent of set of tests Q: u(t) linearly independent of u(Q). 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State Float/Reset POMDP Float: Random walk. Reset: Far right, observe 1 if already there. u(R 1) = 0 0 0 0 1 u(F 0) = 1 1 1 1 1 u(F 0 R 1) = 0 0 0 0.5 0.5 u(R 0 R 1) = 1 1 1 1 0 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State Linear PSR from POMDP Given a POMDP rep, how can we pick tests that are a sufficient statistic? search Start with Q={} While there is some t ∈ Q, such that some a o t independent of Q, add a o t to Q Else terminate, return Q. 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State Properties of search search terminates: Q contains no more than |S| tests. No test in Q is longer than |S|. All tests dependent on Q. MaoT = u(Q)+ Ta Oo u(Q) (u(Q)s cancel). Runs in polynomial time, captures POMDP. 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State Float/Reset Updates F 0 F 0 R 0 F 0 R 0 F 0 R 0 F 0 F 0 R 0 F 0 F 0 R 0 F 0 F 0 F 0 R 0 F 0 F 0 F 0 R 0 0.25 R 0 - 0.0625 F 0 R 0 + 0.750 F 0 F 0 R 0 Tests in Q update on F 0 Interestingly: R 1; F 0 R 1 (non-linear) PSR 1 .5 .5 .375 .375 .3125 .3125… 9/22/2018 Planning and Learning with Hidden State

Planning in PSR V(p) = maxa (So Pr(o|a, p) g(r(o) + V’(p’)) Let V’(p) = maxa p aT (pw linear & convex) Recall p’ = p MaoQT / p maoT: linear PSR So, V(p) = maxa maxa p (So g(maoT r(o)+MaoT aT)) = maxb p bT Incremental pruning works Sampling or function approximation, also. 9/22/2018 Planning and Learning with Hidden State

Learning in a Linear PSR Need to learn maot for each test t in Q Use estimates to produce sequence of ps … F 0 F 0 R 0 R 1 F 0 R 1 F 0 F 0 … mR1F0R0  0 (IS-weighted) mR1F0  1 (IS-weighted) mR1R0  no target p Simple via linear update rule (delta rule). gradient-like, no improvement theorem known. 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State Contributions Formulate important problems as POMDPs Plan via exact and approximate methods Solutions factor in cost of information Predictions are useful as state Rivest & Schapire (1987): deterministic envs Jaeger (1999): stochastic, no actions stochastic environments with actions As expressive as POMDPs with belief states, even with linear predictions. 9/22/2018 Planning and Learning with Hidden State

Planning and Learning with Hidden State Ongoing Work: PSRs Learning PSRs (with Singh, Stone, Sutton) Does it work? Can TD help? How choose tests? Connection between tests and options Abstraction via dropping tests Key tests useful for state and reward 9/22/2018 Planning and Learning with Hidden State

Learning in Float-Reset 9/22/2018 Planning and Learning with Hidden State

Value Iteration for POMDPs Find set of vectors (linear functions) to represent the VF. for t = 1 to k find minimum covering Ct via vectors in Ct-1 One-pass (Sondik 71): Search constrained regions Exhaustive (Monahan 82; Lark 91): Enumerate vectors Linear support (Cheng 88): Enumerate vertices Witness (Littman, Cassandra, Kaelbling 95): Cover by actions. Incremental Pruning (Cassandra, Littman, Zhang 97): Enumerate, periodic removal of redundant vectors 9/22/2018 Planning and Learning with Hidden State