CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1
Reinforcement Learning Lecture 11
CS 484 – Artificial Intelligence3 Reinforcement Learning Addresses the question of how an autonomous agent that senses and acts in its environment can learn to choose optimal actions to achieve its goals Use reward or penalty to indicate the desirability of the resulting state Example problems control a mobile robot learn to optimize operations in a factory learn to play a board game
CS 484 – Artificial Intelligence4 RL Diagram Agent Environment State RewardAction Process: Goal: Learn to choose actions that maximize
CS 484 – Artificial Intelligence5 Simple Grid World Markov Decision Process (MDP) Agent perceives a set S of distinct states Agent has a set A of actions that it can perform Environment responds by giving the agent a reward r t = r(s t, a t ) Environment produces the succeeding state s t +1 = δ(s t, a t ) Task of agent: Learn a policy : S → A (s t ) = a t G r(s,a) (immediate reward) values
CS 484 – Artificial Intelligence6 Learning a policy Need to learn a policy Maximize reward over time Define the cumulative value V (s t ) Learn the optimal policy which maximizes V (s t ) for all states s G Q(s,a) values – expect rewards over time when γ =.9
CS 484 – Artificial Intelligence7 Using values to find optimal policy G One optimal policy G V*(s) values – the value of the highest expected reward from a state
CS 484 – Artificial Intelligence8 Temporal Difference Learning Learn iteratively by reducing the discrepancy between estimated values for adjacent states Initially all values are zero As an agent moves about the environment the values of states are updated according the following formula where is the reinforcement learning constant
CS 484 – Artificial Intelligence9 Calculating the Value of a State Where does these values come from? Use the Bellman equation G V*(s) values – the value of the highest expected reward from a state
CS 484 – Artificial Intelligence10 Our GridWorld It is deterministic so the Bellman equation can be simplified Need a policy (s,a) Suppose the agent selects all actions with equal probability G (s,a)(s,a) 1
CS 484 – Artificial Intelligence11 Our GridWorld Initialize all values to 0 After one application of the Bellman equation G G 0
CS 484 – Artificial Intelligence12 Our GridWorld Step 2 (use old value of s') Step G G 0
CS 484 – Artificial Intelligence13 Our GridWorld Step 4 … Step G G 0
CS 484 – Artificial Intelligence14 Finding the Optimal Policy Modify the Bellman equation from to
CS 484 – Artificial Intelligence15 Our GridWorld Initialize all values to 0 After one application of the Bellman equation G G 0
CS 484 – Artificial Intelligence16 Our GridWorld Step 2 (use old value of s') Step G G 0
CS 484 – Artificial Intelligence17 Other GridWorld AB B' A' Agent can move in 4 directions from each cell If agent moves off the grid, reward = -1 If agent is in State A, all moves take it to State A' and it receives a reward of +10 If agent is in State B, all moves take it to State B' and it receives a reward of Values following a random policy Why is A valued less than 10 and B valued more than 5?