Heuristic Search for problems with uncertainty CSE 574 April 22, 2003 Mausam.

Slides:

Advertisements

Similar presentations

Markov Decision Process

Advertisements

Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Partially Observable Markov Decision Process (POMDP)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

SARSOP Successive Approximations of the Reachable Space under Optimal Policies Devin Grady 4 April 2013.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.

Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G.

Decision Theoretic Planning

Optimal Policies for POMDP Presented by Alp Sardağ.

A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.

An Introduction to Markov Decision Processes Sarah Hickmott

Markov Decision Processes

Planning under Uncertainty

SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.

Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.

Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Models of Planning ClassicalContingent (FO)MDP ???Contingent POMDP ???Conformant (NO)MDP Complete Observation Partial None Uncertainty Deterministic Disjunctive.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Handling non-determinism and incompleteness. Problems, Solutions, Success Measures: 3 orthogonal dimensions  Incompleteness in the initial state  Un.

Concurrent Probabilistic Temporal Planning (CPTP) Mausam Joint work with Daniel S. Weld University of Washington Seattle.

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

Markov Decision Processes CSE 573. Add concrete MDP example No need to discuss strips or factored models Matrix ok until much later Example key (e.g.

5/6: Summary and Decision Theoretic Planning  Last homework socket opened (two more problems to be added—Scheduling, MDPs)  Project 3 due today  Sapa.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Planning Under Uncertainty. 573 Core Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanning Supervised.

Department of Computer Science Undergraduate Events More

Reinforcement Learning Yishay Mansour Tel-Aviv University.

9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

Hardness-Aware Restart Policies Yongshao Ruan, Eric Horvitz, & Henry Kautz IJCAI 2003 Workshop on Stochastic Search.

Instructor: Vincent Conitzer

MAKING COMPLEX DEClSlONS

1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.

INTRODUCTION TO Machine Learning

MDPs (cont) & Reinforcement Learning

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.

CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.

Department of Computer Science Undergraduate Events More

Sutton & Barto, Chapter 4 Dynamic Programming. Programming Assignments? Course Discussions?

Markov Decision Processes Chapter 17 Mausam. Planning Agent What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Keep the Adversary Guessing: Agent Security by Policy Randomization

Markov Decision Process (MDP)

Announcements Grader office hours posted on course website

POMDPs Logistics Outline No class Wed

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

Biomedical Data & Markov Decision Process

Markov Decision Processes

Markov Decision Processes

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming

Instructor: Vincent Conitzer

Chapter 17 – Making Complex Decisions

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

Presentation transcript:

Heuristic Search for problems with uncertainty CSE 574 April 22, 2003 Mausam

Mathematical modelling Search space : finite/infinite state/belief space. Belief state = some idea of where we are (represented as a set of/probability distribution over the states). Initial state/belief, goal set of states/beliefs. Actions and their applicability. Action is applicable in a belief if it is applicable in all constituent states. Action transitions (state to state / belief to belief) Action costs Uncertainty : Disjunctive/Probabilistic Feedback : Zero/Partial/Total

Disjunctive – Zero Known as conformant planning. Belief = set of states (finite) Output : Plan (Sequence of actions) which takes all the possible initial states to some goal states or else outputs failure. Modelled as deterministic search in the belief space. What is the number of belief states if number of state variables is N?

Disjunctive - Partial Known as Contingent Planning Belief = set of states (finite) Output : Policy (Belief -> Action) Bellman Equation : V*(b)=min aεA(b) [c(a)+max oεO(b) V*(b a o )] Where V* represents worst possible cost to reach the goal.

Disjunctive - Total Output : Policy (State -> Action) Bellman Equation V*(s)=min aεA(s) [c(a)+max s’εF(a,s) V*(s’)]

Probabilistic-Zero Deterministic search in the belief space.

Probabilistic - Total Modelled as MDPs. (also called fully observable MDPs) Output : Policy (State -> Action) Bellman Equation V*(s)=min aεA(s) [c(a)+Σ s’εS V*(s’)P(s’|s)]

Probabilistic - Partial Modelled as POMDPs. (partially observable MDPs). Also called Probabilistic Contingent Planning. Belief = probabilistic distribution over states. What is the size of belief space? Output : Policy (Discretized Belief -> Action) Bellman Equation V*(b)=min aεA(b) [c(a)+Σ oεO P(b,a,o) V*(b a o )]

Heuristic Search Heuristic computation. Search using the heuristics. Stopping condition. Speedups.

Heuristic computation Solving a relaxed problem (Admissible) Example: Assume full observability and solve relaxed problem (disjunctive-total) for V* rel (s). h(b)=max sεb V* rel (s) Similarly, for probabilistic problems. Disjunctive-total can be solved in polynomial in |S| time by dynamic programming with initial value function = 0.

Algorithms for search A* : works for sequential solutions. AO* : works for acyclic solutions (Nilsson 80). LAO* : works for cyclic solutions (Hansen and Zilberstein 01). RTDP : works for cyclic solutions (Barto et al 95).

RTDP iteration Start with initial belief and initialize value of each belief as the heuristic value. For current belief Save the action that minimises the current state value in the current policy. Update the value of the belief through Bellman Backup. Apply the minimum action and then randomly pick an observation. Go to next belief assuming that observation. Repeat until goal is achieved.

Stopping Condition ε-convergence : A value function is ε –optimal if the error (residue) at every state/belief less than ε. Ex. (MDPs) : R(s)=|V(s)-(min aεA(s) [c(a)+Σ s’εS V*(s’)P(s’|s)])| Stop when max sεS R(s) < ε

Experiments Test on larger but tame POMDPs. Comparison with other planners? Table v/s Graph Time to compute the heuristics. Converted contingent planning to probabilistic contingent planning with uniform probabilities!!!

Weaknesses Experiments not informative. Scaling to bigger problems.

Easy extensions (Geffner 02) More general cost function. More general goal model. Ex: Pr(reaching goal > Threshold) Extension to epsitemic goal/preconditions An epistemic goal is a goal about complete knowledge of the agent regarding a Boolean.

Speedups Reachability Analysis (LAO*) Cheaper heuristic More informed heuristic (Bertoli and Cimatti 02)

Fast RTDP convergence What are the advantages of RTDP? What are the disadvantages of RTDP? How to speed up RTDP? (Bonet and Geffner 03)

Future Work Discretizations Fine for more probable regions Dynamic Several runs focussing on different areas Don’t discretize Which heuristic to use? Heuristics for information gathering problems. Heuristic search in factored state space?