Eick: Q-Learning for the PD-World COSC 6342 Project 1 Spring 2014 Q-Learning for a Pickup Dropoff World P P PD D D.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Eick: Reinforcement Learning. Topic 18: Reinforcement Learning 1. Introduction 2. Bellman Update 3. Temporal Difference Learning 4. Discussion of Project1.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Computational Methods for Management and Economics Carla Gomes Module 8b The transportation simplex method.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
Staffan Järn.  Intelligent learning algortithm  Doesn’t require the presence of a teacher  The algorithm is given a reward (a reinforcement) for good.
Markov Decision Processes
Planning under Uncertainty
Reinforcement Learning
Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density Amy McGovern Andrew Barto.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Ai in game programming it university of copenhagen Reinforcement Learning [Intro] Marco Loog.
An Overview of MAXQ Hierarchical Reinforcement Learning Thomas G. Dietterich from Oregon State Univ. Presenter: ZhiWei.
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Reinforcement Learning and Soar Shelley Nason. Reinforcement Learning Reinforcement learning: Learning how to act so as to maximize the expected cumulative.
Reinforcement Learning (1)
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.
Balancing Exploration and Exploitation Ratio in Reinforcement Learning Ozkan Ozcan (1stLT/ TuAF)
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete.
George Boulougaris, Kostas Kolomvatsos, Stathes Hadjiefthymiades Building the Knowledge Base of a Buyer Agent Using Reinforcement Learning Techniques Pervasive.
Top level learning Pass selection using TPOT-RL. DT receiver choice function DT is trained off-line in artificial situation DT used in a heuristic, hand-coded.
1 Introduction to Reinforcement Learning Freek Stulp.
Optimizing Pheromone Modification for Dynamic Ant Algorithms Ryan Ward TJHSST Computer Systems Lab 2006/2007 Testing To test the relative effectiveness.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Distributed Q Learning Lars Blackmore and Steve Block.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CPSC 422, Lecture 10Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 10 Sep, 30, 2015.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Ch. Eick: Introduction to Search Classification of Search Problems Search Uninformed Search Heuristic Search State Space SearchConstraint Satisfaction.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Ch. Eick: Randomized Hill Climbing Techniques Randomized Hill Climbing Neighborhood Hill Climbing: Sample p points randomly in the neighborhood of the.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.
Teaching Style COSC 6368 Teaching Style COSC 6368
Markov Decision Processes
PD-World Pickup: Cells: (1,1), (4,1),(3,3),(5,5)
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Randomized Hill Climbing
Chapter 3: The Reinforcement Learning Problem
Randomized Hill Climbing
Instructors: Fei Fang (This Lecture) and Dave Touretzky
B- Trees D. Frey with apologies to Tom Anastasio
Chapter 3: The Reinforcement Learning Problem
Reinforcement Learning
Q-Learning Example Abdeslam Boularias Rutgers University.
Chapter 3: The Reinforcement Learning Problem
Example: Simplified PD World
Introduction to Reinforcement Learning and Q-Learning
COSC 4368 Group Project Spring 2019 Learning Paths from Feedback Using Reinforcement Learning for a Transportation World P D P D D P.
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
TOPIC: - GRID REFERENCE
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

Eick: Q-Learning for the PD-World COSC 6342 Project 1 Spring 2014 Q-Learning for a Pickup Dropoff World P P PD D D

Eick: Q-Learning for the PD-World PD-World Terminal State: Drop off cells contain 5 blocks each Initial State: Agent is in cell (1,5) and pickup cells contain 5 blocks (1,1) (5,4) (1,3) (1,2) (1,5)(1,4) (2,2) (2,1) (2,4) (2,3) (3,2)(3,1) (2,5) (3,4) (3,3) (4,1) (3,5) (4,3)(4,2) (5,2) (4,4) (4,5) (5,1) (5,5) (5,3) Pickup: Cells: (1,1), (3,3),(5,5) Dropoff Cells: (5,1), (5,3), (4,5) Goal: Transport from pickup cells to dropoff cells!

Eick: Q-Learning for the PD-World PD-World Spring 2014 P P PD D D Operators ‒ there are six of them: North, South, East, West are applicable in each state, and move the agent to the cell in that direction except leaving the grid is not allowed. Pickup is only applicable if the agent is in an pickup cell that contain at least one block and if the agent does not already carry a block. Dropoff is only applicable if the agent is in a dropoff cell that contains less that 5 blocks and if the agent carries a block. Initial state of the PD-World: Each pickup cell contains 5 blocks and dropoff cells contain 0 blocks; the agent always starts in position (1,5)

Eick: Q-Learning for the PD-World State Space PD-World Spring 2014 P P PD D D The actual state space of the PD World is as follows: (i, j, x, a, b, c, d, e, f) with (i,j) is the position of the agent x is 1 if the agent carries a block and 0 if not (a,b,c,d,e,f) are the number of blocks in cells (1,1), (3,3),(5,5), (5,1), (5,3), and (4,5), respectively Initial State: (1,1,0,5,5,5,0,0,0) Terminal State: (*,*,0,0,0,0,5,5,5) Remark: The actual reinforcement learning approach likely will use a simplified state space that aggregates multiple states of the actual state space into a single state in the reinforcement learning state space.

Eick: Q-Learning for the PD-World Rewards in the PD-World Spring 2014 P P PD D D Rewards: Picking up a block from a pickup state: +12 Dropping off a block in a dropoff state: +12 Applying north, south, east, west: -1.

Eick: Q-Learning for the PD-World Project1 Policies Spring 2014 PRandom: If pickup and dropoff is applicable, choose this operator; otherwise, choose an operator randomly. PExploit1: If pickup and dropoff is applicable, choose this operator; otherwise, apply the applicable operator with the highest q-value (break ties by rolling a dice for operators with the same utility) with probability 0.6 and choose an applicable operator randomly with probability 0.4. PExploit2: If pickup and dropoff is applicable, choose this operator; otherwise, apply the applicable operator with the highest q-value (break ties by rolling a dice…) with probability 0.85 and choose an applicable operator randomly with probability 0.15.

Eick: Q-Learning for the PD-World Performance Measures a.Bank account of the agent b.Rewards received over number of operators applied over the whole time window or the last 40 operator applications c.Blocks delivered over number of operators applied over the whole time window or the last 40 operator applications

Eick: Q-Learning for the PD-World Reinforcement Learning Search Space1 Reinforcement learning states have the form (i,j,x,s,t,u) where (i,j) is the position of the agent x is 1 if the agent carries a block; otherwise, 0. g, h, i are boolean variables whose meaning depend on, if the agent carries a block or not. –Case 1: x=0 (agent does not carry a block) s is 1, if cell (1,1) contains at least one block t is 1, if cell (3,3) contains at least one block u is 1, if cell (5,5) contains at least one block –Case 2: x=1 (agent does carry a block) s is 1, if cell (5,1) contains less than 5 blocks t is 1, if cell (5,3) contains less than 5 blocks u is 1, if cell (4,5) contains less than 5 blocks There are 400 states total in the reinforcement learning state space1

Eick: Q-Learning for the PD-World Alternative Reinforcement Learning Search Space2 In this approach reinforcement learning states have the form (i,j,x) where (i,j) is the position of the agent x is 1 if the agent carries a block; otherwise, 0. That is in RL Space2 there are only 50 states. Discussion: 1.The problem with state space2 is that the algorithm initially learns paths between pickup states and dropoff states but the q-values will decrease is soon as the pickup states runs out of blocks or a dropoff state is full, and cannot receive any further blocks, as it is no longer attractive to visit these states. Therefore, these path have to be relearned when an agent is restarted to solve the same problem again using the final Q-table of the previous run. This problem does not exist, when the RL state space1 is used. 2.On the other hand, when using the recommended search space, if one of the variables s, t, u switches from 1 to 0 all paths need to be relearned.

Eick: Q-Learning for the PD-World Analysis of Attractive Paths … Demo:

Eick: Q-Learning for the PD-World TD-Q-Learning for the PD-World Goal: Measure the utility of using action a in state s, denoted by Q(a,s); the following update formula is used every time an agent reaches state s’ from s using actions a: Q(a,s)  (1-  )  Q(a,s) +  R(s’,a,s) + γ *max a’ Q(a’,s’)]  is the learning rate;  is the discount factor R(s’,a,s) is the reward of reaching s’ from s by applying a; e.g. -1 if moving, +12 if picking up or dropping blocks for the PD-World. Remark: This is the QL approach you must use in Project1!