Markov Decision Processes CSE 573. Add concrete MDP example No need to discuss strips or factored models Matrix ok until much later Example key (e.g.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Partially Observable Markov Decision Process (POMDP)
Department of Computer Science Undergraduate Events More
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G.
Decision Theoretic Planning
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
An Introduction to Markov Decision Processes Sarah Hickmott
Infinite Horizon Problems
Planning under Uncertainty
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Models of Planning ClassicalContingent (FO)MDP ???Contingent POMDP ???Conformant (NO)MDP Complete Observation Partial None Uncertainty Deterministic Disjunctive.
Reinforcement Learning
Markov Decision Processes
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Concurrent Probabilistic Temporal Planning (CPTP) Mausam Joint work with Daniel S. Weld University of Washington Seattle.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
© Daniel S. Weld 1 Reinforcement Learning CSE 473 Ever Feel Like Pavlov’s Poor Dog?
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
CPSC 322, Lecture 31Slide 1 Probability and Time: Markov Models Computer Science cpsc322, Lecture 31 (Textbook Chpt 6.5) March, 25, 2009.
Planning Under Uncertainty. 573 Core Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanning Supervised.
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Dynamic Bayesian Networks CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanningLearning.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
CSE 473 Artificial Intelligence Review. Logistics Project reports and homework 5 due Monday 5pm Monday 5pm project demo in 002 Exam next Wed 8:30—10:30.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
MDPs (cont) & Reinforcement Learning
Heuristic Search for problems with uncertainty CSE 574 April 22, 2003 Mausam.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Department of Computer Science Undergraduate Events More
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Markov Decision Processes Chapter 17 Mausam. Planning Agent What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Sequential Stochastic Models
POMDPs Logistics Outline No class Wed
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
Chapter 17 – Making Complex Decisions
Hidden Markov Models (cont.) Markov Decision Processes
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Markov Decision Processes CSE 573

Add concrete MDP example No need to discuss strips or factored models Matrix ok until much later Example key (e.g. Mini robot)

Logistics Reading AIMA Ch 21 (Reinforcement Learning) Project 1 due today 2 printouts of report Miao with Source code Document in.doc or.pdf Project 2 description on web New teams By Monday 11/15 - Miao w/ team + direction Feel free to consider other ideas

Idea 1: Spam Filter Decision Tree Learner ? Ensemble of… ? Naïve Bayes ? Bag of Words Representation Enhancement Augment Data Set ?

Idea 2: Localization Placelab data Learn “places” K-means clustering Predict movements between places Markov model, or …. ???????

Proto-idea 3: Captchas The problem of software robots Turing test is big business Break or create Non-vision based?

Proto-idea 4: Openmind.org Repository of Knowledge in NLP What the heck can we do with it????

Openmind Animals

Proto-idea 4: Wordnet Giant graph of concepts Centrally controlled  semantics What to do? Integrate with FAQ lists, Openmind, ???

573 Topics Agency Problem Spaces Search Knowledge Representation & Inference Planning Supervised Learning Logic-BasedProbabilistic Reinforcement Learning

Where are We? Uncertainty Bayesian Networks Sequential Stochastic Processes (Hidden) Markov Models Dynamic Bayesian networks (DBNs) Probabalistic STRIPS Representation Markov Decision Processes (MDPs) Reinforcement Learning

An Example Bayes Net Earthquake BurglaryAlarm Nbr2CallsNbr1Calls Pr(B=t) Pr(B=f) Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) Radio

Planning Percepts Actions What action next? Static Fully Observable Stochastic Instantaneous Full Perfect Planning under uncertainty Environment

Models of Planning ClassicalContingent MDP ???Contingent POMDP ???Conformant POMDP Complete Observation Partial None Uncertainty Deterministic Disjunctive Probabilistic

Recap: Markov Models Q: set of states  init prob distribution A: transition probability distribution ONE per ACTION Markov assumption Stationary model assumption

A Factored domain Variables : has_user_coffee (huc), has_robot_coffee (hrc), robot_is_wet (w), has_robot_umbrella (u), raining (r), robot_in_office (o) Actions : buy_coffee, deliver_coffee, get_umbrella, move What is the number of states? Can we succinctly represent transition probabilities in this case?

Probabilistic “STRIPS”? Move:office  cafe inOffice Raining hasUmbrella -inOffice +Wet P<.1 -inOffice -inOffice +Wet -inOffice

Dynamic Bayesian Nets huc hrc w u r o o r u w huc Total values required to represent transition probability table = 36 Vs 4096

Dynamic Bayesian Net for Move huc hrc w u r o o’ r’ u’ w’ hrc’ huc’ Pr(r=T) Pr(r=F) Pr(w’|u,w) u,w 1.0 (0) u,w 0.1 (0.9) u,w 1.0 (0) Actually table should have 16 entries!

Actions in DBN huc hrc w u r o o r u w huc T T+1 a Last Time: Actions in DBN Unrolling Don’t need them Today

Observability Full Observability Partial Observability No Observability

Reward/cost Each action has an associated cost. Agent may accrue rewards at different stages. A reward may depend on The current state The (current state, action) pair The (current state, action, next state) triplet Additivity assumption : Costs and rewards are additive. Reward accumulated = R(s 0 )+R(s 1 )+R(s 2 )+…

Horizon Finite : Plan till t stages. Reward = R(s 0 )+R(s 1 )+R(s 2 )+…+R(s t ) Infinite : The agent never dies. The reward R(s 0 )+R(s 1 )+R(s 2 )+… Could be unbounded. Discounted reward : R(s 0 )+γR(s 1 )+ γ 2 R(s 2 )+… Average reward : lim n  ∞ (1/n)[Σ i R(s i )] ?

Goal for an MDP Find a policy which: maximizes expected discounted reward over an infinite horizon for a fully observable Markov decision process. Why shouldn’t the planner find a plan?? What is a policy??

Optimal value of a state Define V*(s) `value of a state’ as the maximum expected discounted reward achievable from this state. Value of state if we force it to do action “a” right now, but let it act optimally later: Q*(a,s)=R(s) + c(a) + γΣ s’εS Pr(s’|a,s)V*(s’) V* should satisfy the following equation: V*(s) = max aεA {Q*(a,s)} = R(s) + max aεA {c(a) + γΣ s’εS Pr(s’|a,s)V*(s’)}

Value iteration Assign an arbitrary assignment of values to each state (or use an admissible heuristic). Iterate over the set of states and in each iteration improve the value function as follows: V t+1 (s)=R(s) + max aεA {c(a)+γΣ s’εS Pr(s’|a,s) V t (s’)} `Bellman Backup’ Stop the iteration appropriately. V t approaches V* as t increases.

Max Bellman Backup a1a1 a2a2 a3a3 s VnVn VnVn VnVn VnVn VnVn VnVn VnVn Q n+1 (s,a) V n+1 (s)

Stopping Condition ε-convergence : A value function is ε –optimal if the error (residue) at every state is less than ε. Residue(s)=|V t+1 (s)- V t (s)| Stop when max sεS R(s) < ε

Complexity of value iteration One iteration takes O(|S| 2 |A|) time. Number of iterations required : poly(|S|,|A|,1/(1-γ)) Overall, the algorithm is polynomial in state space! Thus exponential in number of state variables.

Computation of optimal policy Given the value function V*(s), for each state, do Bellman backups and the action which maximises the inner product term is the optimal action.  Optimal policy is stationary (time independent) – intuitive for infinite horizon case.

Policy evaluation Given a policy Π:S  A, find value of each state using this policy. V Π (s) = R(s) + c(Π(s)) + γ[Σ s’εS Pr(s’| Π(s),s)V Π (s’)] This is a system of linear equations involving |S| variables.

Bellman’s principle of optimality A policy Π is optimal if V Π (s) ≥ V Π’ (s) for all policies Π’ and all states s є S. Rather than finding the optimal value function, we can try and find the optimal policy directly, by doing a policy space search.

Policy iteration Start with any policy (Π 0 ). Iterate Policy evaluation : For each state find V Π i (s). Policy improvement : For each state s, find action a* that maximises Q Π i (a,s). If Q Π i (a*,s) > V Π i (s) let Π i+1 (s) = a* else let Π i+1 (s) = Π i (s) Stop when Π i+1 = Π i Converges faster than value iteration but policy evaluation step is more expensive.

Modified Policy iteration Rather than evaluating the actual value of policy by solving system of equations, approximate it by using value iteration with fixed policy.

RTDP iteration Start with initial belief and initialize value of each belief as the heuristic value. For current belief Save the action that minimises the current state value in the current policy. Update the value of the belief through Bellman Backup. Apply the minimum action and then randomly pick an observation. Go to next belief assuming that observation. Repeat until goal is achieved.

Fast RTDP convergence What are the advantages of RTDP? What are the disadvantages of RTDP? How to speed up RTDP?

Other speedups Heuristics Aggregations Reachability Analysis

Going beyond full observability In execution phase, we are uncertain where we are, but we have some idea of where we can be. A belief state = ?

Models of Planning ClassicalContingent MDP ???Contingent POMDP ???Conformant POMDP Complete Observation Partial None Uncertainty Deterministic Disjunctive Probabilistic

Speedups Reachability Analysis More informed heuristic

Mathematical modelling Search space : finite/infinite state/belief space. Belief state = some idea of where we are Initial state/belief. Actions Action transitions (state to state / belief to belief) Action costs Feedback : Zero/Partial/Total

Algorithms for search A* : works for sequential solutions. AO* : works for acyclic solutions. LAO* : works for cyclic solutions. RTDP : works for cyclic solutions.

Full Observability Modelled as MDPs. (also called fully observable MDPs) Output : Policy (State  Action) Bellman Equation V*(s)=max aεA(s) [c(a)+Σ s’εS V*(s’)P(s’|s,a)]

Partial Observability Modelled as POMDPs. (partially observable MDPs). Also called Probabilistic Contingent Planning. Belief = probabilistic distribution over states. What is the size of belief space? Output : Policy (Discretized Belief -> Action) Bellman Equation V*(b)=max aεA(b) [c(a)+Σ oεO P(b,a,o) V*(b a o )]

No observability Deterministic search in the belief space. Output ?