Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.

Similar presentations


Presentation on theme: "CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1."— Presentation transcript:

1 CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1

2 Reinforcement Learning Lecture 11

3 CS 484 – Artificial Intelligence3 Reinforcement Learning Addresses the question of how an autonomous agent that senses and acts in its environment can learn to choose optimal actions to achieve its goals Use reward or penalty to indicate the desirability of the resulting state Example problems control a mobile robot learn to optimize operations in a factory learn to play a board game

4 CS 484 – Artificial Intelligence4 RL Diagram Agent Environment State RewardAction Process: Goal: Learn to choose actions that maximize

5 CS 484 – Artificial Intelligence5 Simple Grid World Markov Decision Process (MDP) Agent perceives a set S of distinct states Agent has a set A of actions that it can perform Environment responds by giving the agent a reward r t = r(s t, a t ) Environment produces the succeeding state s t +1 = δ(s t, a t ) Task of agent: Learn a policy  : S → A  (s t ) = a t G 0 0 0 0 0 0 0 0 0 0 100 r(s,a) (immediate reward) values

6 CS 484 – Artificial Intelligence6 Learning a policy Need to learn a policy Maximize reward over time Define the cumulative value V  (s t ) Learn the optimal policy which maximizes V  (s t ) for all states s G 81 72 90 81 72 90 81 90 81 100 Q(s,a) values – expect rewards over time when γ =.9

7 CS 484 – Artificial Intelligence7 Using values to find optimal policy G One optimal policy 90100 8190100 G V*(s) values – the value of the highest expected reward from a state

8 CS 484 – Artificial Intelligence8 Temporal Difference Learning Learn iteratively by reducing the discrepancy between estimated values for adjacent states Initially all values are zero As an agent moves about the environment the values of states are updated according the following formula where  is the reinforcement learning constant

9 CS 484 – Artificial Intelligence9 Calculating the Value of a State Where does these values come from? Use the Bellman equation 90100 8190100 G V*(s) values – the value of the highest expected reward from a state

10 CS 484 – Artificial Intelligence10 Our GridWorld It is deterministic so the Bellman equation can be simplified Need a policy  (s,a) Suppose the agent selects all actions with equal probability G.5.33.5.33.5.33.5 (s,a)(s,a) 1

11 CS 484 – Artificial Intelligence11 Our GridWorld Initialize all values to 0 After one application of the Bellman equation 00 000 G 0 033.333 0050.0 G 0

12 CS 484 – Artificial Intelligence12 Our GridWorld Step 2 (use old value of s') Step 3 15.033.333 025.050.0 G 0 1545.333 182561.25 G 0

13 CS 484 – Artificial Intelligence13 Our GridWorld Step 4 … Step 58 28.545.333 1837.37561.25 G 0 51.82266.064 49.09757.28175.777 G 0

14 CS 484 – Artificial Intelligence14 Finding the Optimal Policy Modify the Bellman equation from to

15 CS 484 – Artificial Intelligence15 Our GridWorld Initialize all values to 0 After one application of the Bellman equation 00 000 G 0 0100 00 G 0

16 CS 484 – Artificial Intelligence16 Our GridWorld Step 2 (use old value of s') Step 3 90100 090100 G 0 90100 090100 G 0

17 CS 484 – Artificial Intelligence17 Other GridWorld AB B' A' +10 +5 Agent can move in 4 directions from each cell If agent moves off the grid, reward = -1 If agent is in State A, all moves take it to State A' and it receives a reward of +10 If agent is in State B, all moves take it to State B' and it receives a reward of +5 1.849.543.605.500.81 0.853.002.091.910.33 0.050.870.800.55-0.16 -0.400.000.05-0.09-0.47 -0.71-0.52-0.48-0.54-0.73 Values following a random policy Why is A valued less than 10 and B valued more than 5?


Download ppt "CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1."

Similar presentations


Ads by Google