MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Reinforcement Learning
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
Markov Decision Processes
Planning under Uncertainty
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
Markov Decision Processes
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning (1)
1 Quality of Experience Control Strategies for Scalable Video Processing Wim Verhaegh, Clemens Wüst, Reinder J. Bril, Christian Hentschel, Liesbeth Steffens.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Department of Computer Science Undergraduate Events More
Some Final Thoughts Abhijit Gosavi. From MDPs to SMDPs The Semi-MDP is a more general model in which the time for transition is also a random variable.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Polyhedral Optimization Lecture 5 – Part 3 M. Pawan Kumar Slides available online
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.
Making complex decisions
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Markov Decision Processes
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
Course Logistics CS533: Intelligent Agents and Decision Making
Announcements Homework 3 due today (grace period through Friday)
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Reinforcement Learning
Chapter 17 – Making Complex Decisions
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Presentation transcript:

MDP Reinforcement Learning

Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $

Charity MDP  State space : 3 states  Actions : “Should you give money to charity ”,“Would you contribute”  Observations : knowledge of current state  Rewards : in final state, positive reward proportional to amount of money gathered

 So how can we raise the most money (maximize the reward)?  I.e. What is the best policy? –Policy : optimal action for each state

Lecture Outline 1. Computing the Value Function 2. Finding the Optimal Policy 3. Computing the Value Function in an Online Environment

Useful definitions Define:  to be a policy  (j) : the action to take in j R(j) the reward from a certain state f(j,  ) : the next state, starting from state j and performing action 

Computing The Value Function  When the reward is known, we can compute the value function for a particular policy  V  (j), the value function : Expected reward for being in state j, and following a certain policy 

Calculating V  (j) 1. Set V  0 (j) = 0, for all j 2. For i = 1 to Max_i  V  i (j) = R(j) +  V  (i-1) (f(j,  (j)))  = the discount rate, measures how much future rewards can propagate to previous states Above formula depends on the rewards being known

Value Fn for the Charity MDP Fixing  at.5, and two policies, one which asks both questions, and the other cuts to the chase What is V  3 if : Assume that the reward is constant at the final state (everyone gives the same amount of money) 2. Assume that if you ask if one should give to charity, the reward is 10 times higher.

 Given the value function, how can we find the policy which maximizes the rewards?

Policy Iteration 1. Set  0 to be an arbitrary policy 2. Set i to 0 3. Compute V  i (j) for all states j 4. Compute  (i+1) (j) = argmax  V  i (f(j,  )) 5. If  (i+1) =  i stop, otherwise i++ and back to step 3 What would this for the charity MDP for the two cases?

Lecture Outline 1. Computing the Value Function 2. Finding the Optimal Policy 3. Computing the Value Function in an Online Environment

MDP Learning  So, the rewards are known, we can calculate the optimal policy using policy iteration.  But what happens in the case where we don’t know the rewards?

Lecture Outline 1. Computing the Value Function 2. Finding the Optimal Policy 3. Computing the Value Function in an Online Environment

Deterministic vs. Stochastic Update Deterministic : V  i (j) = R(j) +  V  (i-1) (f(j,  (j))) Stochastic : V  (n) = (1 -  ) V  (n) +  [r +  V  (n’)]  Difference in that stochastic version averages over all visits to the state

MDP extensions  Probabilistic state transitions  How should you calculate the value function for the first state now? MadHappy “Would you like to contribute?” ““Would you like to contribute?”

Probabilistic Transitions  Online computation strategy works the same even when state transitions are unknown  Works in the case when you don’t know what the transitions are

Online V  (j) Computation 1. For each j initialize V  (j) = 0 2. Set n = initial state 3. Set r = reward in state n 4. Let n’ = f(n,  (n)) 5. V  (n) = (1 -  ) V  (n) +  [r +  V  (n’)] 6. n = n’, and back to step 3

1-step Q-learning 1. Initialize Q(n,a) arbitraily 2. Select  as policy 3. n = initial state, r = reward, a =  (n) 4. Q(n,a) = (1 -  ) Q(n,a) +  [r +  max a’ Q (n’,a’)] 5. n = n’, and back to step 3