Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham 2003.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

RL - Worksheet -worked exercise- Ata Kaban School of Computer Science University of Birmingham.
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Markov Decision Process
Worksheet I. Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
RL for Large State Spaces: Value Function Approximation
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
1. Algorithms for Inverse Reinforcement Learning 2
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Planning under Uncertainty
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Reinforcement Learning
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.
Department of Computer Science Undergraduate Events More
Reinforcement Learning (1)
1 Quality of Experience Control Strategies for Scalable Video Processing Wim Verhaegh, Clemens Wüst, Reinder J. Bril, Christian Hentschel, Liesbeth Steffens.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
Reinforcement Learning
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Reinforcement Learning
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
Department of Computer Science Undergraduate Events More
MDPs (cont) & Reinforcement Learning
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Department of Computer Science Undergraduate Events More
CHAPTER 11 R EINFORCEMENT L EARNING VIA T EMPORAL D IFFERENCES Organization of chapter in ISSO –Introduction –Delayed reinforcement –Basic temporal difference.
Markov Decision Process (MDP)
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Reinforcement Learning
Making complex decisions
NATCOR Stochastic Modelling Course Inventory Control – Part 2
Markov Decision Processes
Announcements Homework 3 due today (grace period through Friday)
CMSC 471 Fall 2009 RL using Dynamic Programming
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Dr. Unnikrishnan P.C. Professor, EEE
CS 188: Artificial Intelligence Fall 2007
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Instructor: Vincent Conitzer
Chapter 4: Dynamic Programming
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Reinforcement Nisheeth 18th January 2019.
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
CHAPTER 11 REINFORCEMENT LEARNING VIA TEMPORAL DIFFERENCES
Presentation transcript:

Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham 2003

Exercise 1 In the grid based environment below the state values have all been computed except for one. Possible actions are up, down, left and right. All other actions result in no reward except those that move the agent out of states A and B. Calculate the value of the blank state assuming a random policy (the action is selected randomly between those possible). Consider a discount reward  = 0.9.

Solution V  (5) = 0.25( 0 +  V  (1)) ( 0 +  V  (2)) ( 0 +  V  (3)) ( 0 +  V  (4)) V  (5) = 0.25(0.9)[ ] = 2.25

Exercise 2 The diagram below depicts an MDP model of a fierce battle.

You can move between two locations, L1 and L2, one of them being closer to the adversary. If you attack from the closest state, –then you have more chances (90%) to succeed (while only 70% from the farther location), –however you could also be detected (with 80% chance) and killed (while the chances of being detected from the farther location is 50%). You can only be detected if you stay in the same location. You need to come up with an action plan for the situation.

The arrows represent the possible actions: – ‘move’ (M) is a deterministic action –‘attack’ (A) and ‘stay’ (S) are stochastic. For the stochastic actions, the probabilities of transitioning to the next state are indicated on the arrow. All rewards are 0, except in the terminal states, where your success is represented by a reward of +50 while your adversary’s success is a reward of -50 for you. Employing a discount factor of 0.9, compute an optimal policy (action plan).

Solution The computations of action-values for all states and actions are required. Denote by In value iteration, we start with initial estimates (for all other states) Then we update all action values according to the update rule: where

Hence, in the first iteration of the algorithm we get: The values for the ‘move’ action stay the same (at 0): After this iteration, the values of the two states are and they correspond to the action of ‘attacking’ in both states.

The next iteration gives the following: The new V-values are (by computing max): These correspond to the ‘attack’ action in both states.

This process can continue until the values do not change much between successive iterations. From what we can see at this point, the best action plan seems to be attacking all the time. Can we say more without a full computer simulation?

Continuing (optional)… It is clear that to ‘Stay’ is suboptimal in both states. In the Close state, it is also clear that the best thing to do is to ‘Attack’ continuously (given that we have no cost for that). Actually we can compute the values in the limit analytically (if you keep an eye at changes in update from iteration to iteration)

Now for the far state, the question is between ‘Attack’ or ‘Move’ to the closer orbit. Compute the values for both these actions (in the same way as before):

Hence it is worth moving closer to the orbit. The optimal policy for this problem setting (!) is to move closer and attack from there. Can you imagine a different policy making more sense for this problem? Can you imagine another setting (parameter design) which would lead to a different (more desirable) optimal policy? Designing the parameter setting for a situation according to the conditions is up to the human and not up to the machine… Well in this exercise all parameters were given but in your potential future real applications will be not.