Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Decision Theoretic Planning
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Nonlinear Optimization for Optimal Control
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
An Introduction to Markov Decision Processes Sarah Hickmott
Bayes Filters Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read the.
Markov Decision Processes
Planning under Uncertainty
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Particle Filters Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read.
Maximum A Posteriori (MAP) Estimation Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Discretization Pieter Abbeel UC Berkeley EECS
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning
Kalman Filtering Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
12/16 Projects Write-up – Generally, 1-2 pages – What was your idea? – How did you try to accomplish it? – What worked? – What you could have done differently?
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.
INTRODUCTION TO Machine Learning
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
MDPs (cont) & Reinforcement Learning
gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Department of Computer Science Undergraduate Events More
MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 541: Artificial Intelligence Lecture X: Markov Decision Process Slides Credit: Peter Norvig and Sebastian Thrun.
Making complex decisions
Markov Decision Processes
UAV Route Planning in Delay Tolerant Networks
CS 188: Artificial Intelligence
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Chapter 3: The Reinforcement Learning Problem
Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008
Chapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Spring 2006
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Announcements Homework 2 Project 2 Mini-contest 1 (optional)
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAA

[Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998] Markov Decision Process Assumption: agent gets to observe the state

Markov Decision Process (S, A, T, R, H) Given S: set of states A: set of actions T: S x A x S x {0,1,…,H}  [0,1], T t (s,a,s’) = P( s t+1 = s’ | s t = s, a t =a) R: S x A x S x {0, 1, …, H}  < R t (s,a,s’) = reward for ( s t+1 = s’, s t = s, a t =a) H: horizon over which the agent will act Goal: Find ¼ : S x {0, 1, …, H}  A that maximizes expected sum of rewards, i.e.,

MDP (S, A, T, R, H), goal:  Cleaning robot  Walking robot  Pole balancing  Games: tetris, backgammon  Server management  Shortest path problems  Model for animals, people Examples

Canonical Example: Grid World  The agent lives in a grid  Walls block the agent’s path  The agent’s actions do not always go as planned:  80% of the time, the action North takes the agent North (if there is no wall there)  10% of the time, North takes the agent West; 10% East  If there is a wall in the direction the agent would have been taken, the agent stays put  Big rewards come at the end

Grid Futures 6 Deterministic Grid WorldStochastic Grid World X X E N S W X ? X XX

Solving MDPs In an MDP, we want an optimal policy  *: S x 0:H → A A policy  gives an action for each state for each time An optimal policy maximizes expected sum of rewards Contrast: In deterministic, want an optimal plan, or sequence of actions, from start to a goal t=0 t=1 t=2 t=3 t=4 t=5=H

Value Iteration Idea: = the expected sum of rewards accumulated when starting from state s and acting optimally for a horizon of i steps Algorithm: Start with for all s. For i=1, …, H Given V i *, calculate for all states s 2 S: This is called a value update or Bellman update/back-up

Example

Example: Value Iteration Information propagates outward from terminal states and eventually all states have correct value estimates V2V2 V3V3

Practice: Computing Actions Which action should we chose from state s: Given optimal values V*? = greedy action with respect to V* = action choice with one step lookahead w.r.t. V* 11

Optimal control: provides general computational approach to tackle control problems. Dynamic programming / Value iteration Discrete state spaces (DONE!) Discretization of continuous state spaces Linear systems LQR Extensions to nonlinear settings: Local linearization Differential dynamic programming Optimal Control through Nonlinear Optimization Open-loop Model Predictive Control Examples: Today and forthcoming lectures