CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham 2003.
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
Markov Decision Processes
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Reinforcement Learning
Distributed Q Learning Lars Blackmore and Steve Block.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Reinforcement Learning Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Machine Learning Chapter 13. Reinforcement Learning
Reinforcement Learning
Introduction Many decision making problems in real life
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Reinforcement Learning
Reinforcement Learning Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
MDPs (cont) & Reinforcement Learning
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement Learning
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Focused Crawler for Topic Specific Portal Construction Ruey-Lung, Hsiao 25 Oct, 2000 Toward A Full Automatic Web Site Construction & Service (II)
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.
Reinforcement Learning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 10
Reinforcement learning
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Markov Decision Processes
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
SNU BioIntelligence Lab.
Bring Order to The Web Ruey-Lung, Hsiao May 4 , 2000.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
یادگیری تقویتی Reinforcement Learning
CS 188: Artificial Intelligence Spring 2006
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CS 188: Artificial Intelligence Spring 2006
Announcements Homework 2 Project 2 Mini-contest 1 (optional)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1

Reinforcement Learning Lecture 11

CS 484 – Artificial Intelligence3 Reinforcement Learning Addresses the question of how an autonomous agent that senses and acts in its environment can learn to choose optimal actions to achieve its goals Use reward or penalty to indicate the desirability of the resulting state Example problems control a mobile robot learn to optimize operations in a factory learn to play a board game

CS 484 – Artificial Intelligence4 RL Diagram Agent Environment State RewardAction Process: Goal: Learn to choose actions that maximize

CS 484 – Artificial Intelligence5 Simple Grid World Markov Decision Process (MDP) Agent perceives a set S of distinct states Agent has a set A of actions that it can perform Environment responds by giving the agent a reward r t = r(s t, a t ) Environment produces the succeeding state s t +1 = δ(s t, a t ) Task of agent: Learn a policy  : S → A  (s t ) = a t G r(s,a) (immediate reward) values

CS 484 – Artificial Intelligence6 Learning a policy Need to learn a policy Maximize reward over time Define the cumulative value V  (s t ) Learn the optimal policy which maximizes V  (s t ) for all states s G Q(s,a) values – expect rewards over time when γ =.9

CS 484 – Artificial Intelligence7 Using values to find optimal policy G One optimal policy G V*(s) values – the value of the highest expected reward from a state

CS 484 – Artificial Intelligence8 Temporal Difference Learning Learn iteratively by reducing the discrepancy between estimated values for adjacent states Initially all values are zero As an agent moves about the environment the values of states are updated according the following formula where  is the reinforcement learning constant

CS 484 – Artificial Intelligence9 Calculating the Value of a State Where does these values come from? Use the Bellman equation G V*(s) values – the value of the highest expected reward from a state

CS 484 – Artificial Intelligence10 Our GridWorld It is deterministic so the Bellman equation can be simplified Need a policy  (s,a) Suppose the agent selects all actions with equal probability G (s,a)(s,a) 1

CS 484 – Artificial Intelligence11 Our GridWorld Initialize all values to 0 After one application of the Bellman equation G G 0

CS 484 – Artificial Intelligence12 Our GridWorld Step 2 (use old value of s') Step G G 0

CS 484 – Artificial Intelligence13 Our GridWorld Step 4 … Step G G 0

CS 484 – Artificial Intelligence14 Finding the Optimal Policy Modify the Bellman equation from to

CS 484 – Artificial Intelligence15 Our GridWorld Initialize all values to 0 After one application of the Bellman equation G G 0

CS 484 – Artificial Intelligence16 Our GridWorld Step 2 (use old value of s') Step G G 0

CS 484 – Artificial Intelligence17 Other GridWorld AB B' A' Agent can move in 4 directions from each cell If agent moves off the grid, reward = -1 If agent is in State A, all moves take it to State A' and it receives a reward of +10 If agent is in State B, all moves take it to State B' and it receives a reward of Values following a random policy Why is A valued less than 10 and B valued more than 5?