Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham 2003.
Worksheet I. Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.
RL for Large State Spaces: Value Function Approximation
Markov Decision Processes
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Reinforcement Learning Tutorial
Reinforcement Learning
An Introduction to Reinforcement Learning (Part 1) Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Markov Decision Processes
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Machine Learning Chapter 13. Reinforcement Learning
Reinforcement Learning
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
Introduction Many decision making problems in real life
Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.
Reinforcement Learning
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
CHAPTER 10 Reinforcement Learning Utility Theory.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday 29 October 2002 William.
MDPs (cont) & Reinforcement Learning
Reinforcement Learning
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning (1)
Markov Decision Processes
Markov Decision Processes
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
Reinforcement learning
Reinforcement Learning
Reinforcement Learning
CS 188: Artificial Intelligence Spring 2006
Introduction to Reinforcement Learning and Q-Learning
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham

Learning by reinforcement Examples: –Learning to play Backgammon –Robot learning to dock on battery charger Characteristics: –No direct training examples – delayed rewards instead –Need for exploration & exploitation –The environment is stochastic and unknown –The actions of the learner affect future rewards

Brief history & successes Minsky’s PhD thesis (1954): Stochastic Neural-Analog Reinforcement Computer Analogies with animal learning and psychology TD-Gammon (Tesauro, 1992) – big success story Job-shop scheduling for NASA space missions (Zhang and Dietterich, 1997) Robotic soccer (Stone and Veloso, 1998) – part of the world-champion approach ‘An approximate solution to a complex problem can be better than a perfect solution to a simplified problem’

The RL problem States Actions Immediate rewards Eventual reward Discount factor from any starting state

Markov Decision Process (MDP) MDP is a formal model of the RL problem At each discrete time point –Agent observes state s t and chooses action a t –Receives reward r t from the environment and the state changes to s t+1 Markov assumption: r t =r(s t,a t ) s t+1 =  (s t,a t ) i.e. r t and s t+1 depend only on the current state and action –In general, the functions r and  may not be deterministic and are not necessarily known to the agent

Agent’s Learning Task Execute actions in environment, observe results and Learn action policy that maximises from any starting state in S. Here is the discount factor for future rewards Note: Target function is There are no training examples of the form (s,a) but only of the form ((s,a),r)

Example: TD-Gammon Immediate reward: +100 if win -100 if lose 0 for all other states Trained by playing 1.5 million games against itself Now approximately equal to the best human player

Example: Mountain-Car States: position and velocity Actions: accelerate forward, accelerate backward, coast Rewards –Reward=-1for every step, until the car reaches the top –Reward=1 at the top, 0 otherwise,  <1 The eventual reward will be maximised by minimising the number of steps to the top of the hill

Q Learning algorithm (in deterministic worlds) For each (s,a) initialise table entry Observe current state s Do forever: –Select an action a and execute it –Receive immediate reward r –Observe new state s’ –Update table entry as follows: –s:=s’

Example updating Q given the Q values from a previous iteration on the arrows

Exploration versus Exploitation The Q-learning algorithm doesn’t say how we could choose an action If we choose an action that maximises our estimate of Q we could end up not exploring better alternatives To converge on the true Q values we must favour higher estimated Q values but still have a chance of choosing worse estimated Q values for exploration (see the convergence proof of the Q-learning algorithm in [Mitchell, sec ]). An action selection function of the following form may employed, where k>0:

Summary Reinforcement learning is suitable for learning in uncertain environments where rewards may be delayed and subject to chance The goal of a reinforcement learning program is to maximise the eventual reward Q-learning is a form of reinforcement learning that doesn’t require that the learner has prior knowledge of how its actions affect the environment

Further topics: Nondeterministic case What if the reward and the state transition are not deterministic? – e.g. in Backgammon learning and playing depends on rolls of dice! Then V and Q needs redefined by taking expected values Similar reasoning and convergent update iteration will apply Will continue next week.