Presentation is loading. Please wait.

Presentation is loading. Please wait.

Eick: Q-Learning for the PD-World COSC 6342 Project 1 Spring 2014 Q-Learning for a Pickup Dropoff World P P PD D D.

Similar presentations


Presentation on theme: "Eick: Q-Learning for the PD-World COSC 6342 Project 1 Spring 2014 Q-Learning for a Pickup Dropoff World P P PD D D."— Presentation transcript:

1 Eick: Q-Learning for the PD-World COSC 6342 Project 1 Spring 2014 Q-Learning for a Pickup Dropoff World P P PD D D

2 Eick: Q-Learning for the PD-World PD-World Terminal State: Drop off cells contain 5 blocks each Initial State: Agent is in cell (1,5) and pickup cells contain 5 blocks (1,1) (5,4) (1,3) (1,2) (1,5)(1,4) (2,2) (2,1) (2,4) (2,3) (3,2)(3,1) (2,5) (3,4) (3,3) (4,1) (3,5) (4,3)(4,2) (5,2) (4,4) (4,5) (5,1) (5,5) (5,3) Pickup: Cells: (1,1), (3,3),(5,5) Dropoff Cells: (5,1), (5,3), (4,5) Goal: Transport from pickup cells to dropoff cells!

3 Eick: Q-Learning for the PD-World PD-World Spring 2014 P P PD D D Operators ‒ there are six of them: North, South, East, West are applicable in each state, and move the agent to the cell in that direction except leaving the grid is not allowed. Pickup is only applicable if the agent is in an pickup cell that contain at least one block and if the agent does not already carry a block. Dropoff is only applicable if the agent is in a dropoff cell that contains less that 5 blocks and if the agent carries a block. Initial state of the PD-World: Each pickup cell contains 5 blocks and dropoff cells contain 0 blocks; the agent always starts in position (1,5)

4 Eick: Q-Learning for the PD-World State Space PD-World Spring 2014 P P PD D D The actual state space of the PD World is as follows: (i, j, x, a, b, c, d, e, f) with (i,j) is the position of the agent x is 1 if the agent carries a block and 0 if not (a,b,c,d,e,f) are the number of blocks in cells (1,1), (3,3),(5,5), (5,1), (5,3), and (4,5), respectively Initial State: (1,1,0,5,5,5,0,0,0) Terminal State: (*,*,0,0,0,0,5,5,5) Remark: The actual reinforcement learning approach likely will use a simplified state space that aggregates multiple states of the actual state space into a single state in the reinforcement learning state space.

5 Eick: Q-Learning for the PD-World Rewards in the PD-World Spring 2014 P P PD D D Rewards: Picking up a block from a pickup state: +12 Dropping off a block in a dropoff state: +12 Applying north, south, east, west: -1.

6 Eick: Q-Learning for the PD-World Project1 Policies Spring 2014 PRandom: If pickup and dropoff is applicable, choose this operator; otherwise, choose an operator randomly. PExploit1: If pickup and dropoff is applicable, choose this operator; otherwise, apply the applicable operator with the highest q-value (break ties by rolling a dice for operators with the same utility) with probability 0.6 and choose an applicable operator randomly with probability 0.4. PExploit2: If pickup and dropoff is applicable, choose this operator; otherwise, apply the applicable operator with the highest q-value (break ties by rolling a dice…) with probability 0.85 and choose an applicable operator randomly with probability 0.15.

7 Eick: Q-Learning for the PD-World Performance Measures a.Bank account of the agent b.Rewards received over number of operators applied over the whole time window or the last 40 operator applications c.Blocks delivered over number of operators applied over the whole time window or the last 40 operator applications

8 Eick: Q-Learning for the PD-World Reinforcement Learning Search Space1 Reinforcement learning states have the form (i,j,x,s,t,u) where (i,j) is the position of the agent x is 1 if the agent carries a block; otherwise, 0. g, h, i are boolean variables whose meaning depend on, if the agent carries a block or not. –Case 1: x=0 (agent does not carry a block) s is 1, if cell (1,1) contains at least one block t is 1, if cell (3,3) contains at least one block u is 1, if cell (5,5) contains at least one block –Case 2: x=1 (agent does carry a block) s is 1, if cell (5,1) contains less than 5 blocks t is 1, if cell (5,3) contains less than 5 blocks u is 1, if cell (4,5) contains less than 5 blocks There are 400 states total in the reinforcement learning state space1

9 Eick: Q-Learning for the PD-World Alternative Reinforcement Learning Search Space2 In this approach reinforcement learning states have the form (i,j,x) where (i,j) is the position of the agent x is 1 if the agent carries a block; otherwise, 0. That is in RL Space2 there are only 50 states. Discussion: 1.The problem with state space2 is that the algorithm initially learns paths between pickup states and dropoff states but the q-values will decrease is soon as the pickup states runs out of blocks or a dropoff state is full, and cannot receive any further blocks, as it is no longer attractive to visit these states. Therefore, these path have to be relearned when an agent is restarted to solve the same problem again using the final Q-table of the previous run. This problem does not exist, when the RL state space1 is used. 2.On the other hand, when using the recommended search space, if one of the variables s, t, u switches from 1 to 0 all paths need to be relearned.

10 Eick: Q-Learning for the PD-World Analysis of Attractive Paths … Demo: http://www2.hawaii.edu/~chenx/ics699rl/grid/gridworld.html http://www2.hawaii.edu/~chenx/ics699rl/grid/gridworld.html

11 Eick: Q-Learning for the PD-World TD-Q-Learning for the PD-World Goal: Measure the utility of using action a in state s, denoted by Q(a,s); the following update formula is used every time an agent reaches state s’ from s using actions a: Q(a,s)  (1-  )  Q(a,s) +  R(s’,a,s) + γ *max a’ Q(a’,s’)]  is the learning rate;  is the discount factor R(s’,a,s) is the reward of reaching s’ from s by applying a; e.g. -1 if moving, +12 if picking up or dropping blocks for the PD-World. Remark: This is the QL approach you must use in Project1!


Download ppt "Eick: Q-Learning for the PD-World COSC 6342 Project 1 Spring 2014 Q-Learning for a Pickup Dropoff World P P PD D D."

Similar presentations


Ads by Google