Reinforcement Learning

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

Markov Decision Process

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Reinforcement Learning

Minimax and Alpha-Beta Reduction Borrows from Spring 2006 CS 440 Lecture Slides.

Markov Decision Processes

Infinite Horizon Problems

Planning under Uncertainty

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

Reinforcement Learning

Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.

Markov Decision Processes

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Reinforcement Learning Introduction Presented by Alp Sardağ.

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.

Reinforcement Learning (1)

Making Decisions CSE 592 Winter 2003 Henry Kautz.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

MAKING COMPLEX DEClSlONS

Machine Learning Chapter 13. Reinforcement Learning

Reinforcement Learning

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Reinforcement Learning

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 27 April 2007 William.

Reinforcement Learning

Reinforcement Learning Yishay Mansour Tel-Aviv University.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday 29 October 2002 William.

INTRODUCTION TO Machine Learning

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

MDPs (cont) & Reinforcement Learning

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

Reinforcement Learning

Markov Decision Process (MDP)

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Reinforcement Learning

Reinforcement Learning (RL)

Reinforcement Learning

Markov Decision Processes

Reinforcement Learning

Markov Decision Processes

Planning to Maximize Reward: Markov Decision Processes

Markov Decision Processes

Reinforcement learning

Dr. Unnikrishnan P.C. Professor, EEE

CS 188: Artificial Intelligence Fall 2007

Learning to Maximize Reward: Reinforcement Learning

Introduction to Reinforcement Learning and Q-Learning

Markov Decision Processes

Markov Decision Processes

Presentation transcript:

Reinforcement Learning Michael L. Littman Slides from http://www.cs.vu.nl/~elena/ml_13light.ppt which appear to have been adapted from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-3/www/l20.ps Machine Learning RL

Reinforcement Learning ML 10 02-01 Reinforcement Learning [Read Ch. 13] [Exercise 13.2] Control learning Control policies that choose optimal actions Q learning Convergence Machine Learning RL Elena Marchiori

Control Learning Consider learning to choose actions, e.g., ML 10 02-01 Control Learning Consider learning to choose actions, e.g., Robot learning to dock on battery charger Learning to choose actions to optimize factory output Learning to play Backgammon Note several problem characteristics: Delayed reward Opportunity for active exploration Possibility that state only partially observable Possible need to learn multiple tasks with same sensors/effectors Machine Learning RL Elena Marchiori

One Example: TD-Gammon ML 10 02-01 One Example: TD-Gammon [Tesauro, 1995] Learn to play Backgammon. Immediate reward: +100 if win -100 if lose 0 for all other states Trained by playing 1.5 million games against itself. Now approximately equal to best human player. Machine Learning RL Elena Marchiori

Reinforcement Learning Problem ML 10 02-01 Reinforcement Learning Problem Goal: learn to choose actions that maximize r0 + r1 + 2r2 + … , where 0   < 1 Machine Learning RL Elena Marchiori

Markov Decision Processes ML 10 Markov Decision Processes 02-01 Assume finite set of states S set of actions A at each discrete time agent observes state st  S and chooses action at  A then receives immediate reward rt and state changes to st+1 Markov assumption: st+1 = (st, at) and rt = r(st, at) i.e., rt and st+1 depend only on current state and action functions  and r may be nondeterministic functions  and r not necessarily known to agent Machine Learning RL Elena Marchiori

Agent’s Learning Task ML 10 02-01 Execute actions in environment, observe results, and learn action policy  : S  A that maximizes E [rt + rt+1 + 2rt+2 + … ] from any starting state in S here 0   < 1 is the discount factor for future rewards (sometimes makes sense with  = 1) Note something new: Target function is  : S  A But, we have no training examples of form s, a Training examples are of form s, a , r Machine Learning RL Elena Marchiori

Value Function To begin, consider deterministic worlds... ML 10 02-01 To begin, consider deterministic worlds... For each possible policy  the agent might adopt, we can define an evaluation function over states where rt, rt+1, ... are generated by following policy  starting at state s Restated, the task is to learn the optimal policy * Machine Learning RL Elena Marchiori

ML 10 02-01 Machine Learning RL Elena Marchiori

What to Learn ML 10 02-01 We might try to have agent learn the evaluation function V* (which we write as V*) It could then do a look-ahead search to choose the best action from any state s because A problem: This can work if agent knows  : S  A  S, and r : S  A   But when it doesn't, it can't choose actions this way Machine Learning RL Elena Marchiori

Q Function Define a new function very similar to V* ML 10 02-01 Define a new function very similar to V* If agent learns Q, it can choose optimal action even without knowing ! Q is the evaluation function the agent will learn [Watkins 1989]. Machine Learning RL Elena Marchiori

Training Rule to Learn Q ML 10 02-01 Note Q and V* are closely related: This allows us to write Q recursively as Nice! Let denote learner’s current approximation to Q. Consider training rule where s’ is the state resulting from applying action a in state s. Machine Learning RL Elena Marchiori

Q Learning for Deterministic Worlds ML 10 02-01 For each s, a initialize table entry Observe current state s Do forever: Select an action a and execute it Receive immediate reward r Observe the new state s’ Update the table entry for as follows: s  s’. Machine Learning RL Elena Marchiori

Updating Q Notice if rewards non-negative, then and Machine Learning ML 10 02-01 Notice if rewards non-negative, then and Machine Learning RL Elena Marchiori

Convergence Theorem ML 10 02-01 converges to Q. Consider case of deterministic world, with bounded immediate rewards, where each s, a visited infinitely often. Proof: Define a full interval to be an interval during which each s, a is visited. During each full interval the largest error in table is reduced by factor of . Let be table after n updates, and Δn be the maximum error in : that is For any table entry updated on iteration n+1, the error in the revised estimate is Machine Learning RL Elena Marchiori

Note we used general fact that: ML 10 02-01 Note we used general fact that: This works with things other than max that satisfy this non-expansion property [Szepesvári & Littman, 1999]. Machine Learning RL Elena Marchiori

Non-deterministic Case (1) ML 10 Non-deterministic Case (1) 02-01 What if reward and next state are non-deterministic? We redefine V and Q by taking expected values. Machine Learning RL Elena Marchiori

Nondeterministic Case (2) ML 10 Nondeterministic Case (2) 02-01 Q learning generalizes to nondeterministic worlds Alter training rule to where Can still prove convergence of to Q [Watkins and Dayan, 1992]. Standard properties:  n = 0,  n2 = . Machine Learning RL Elena Marchiori

Temporal Difference Learning (1) ML 10 Temporal Difference Learning (1) 02-01 Q learning: reduce discrepancy between successive Q estimates One step time difference: Why not two steps? Or n? Blend all of these: Machine Learning RL Elena Marchiori

Temporal Difference Learning (2) ML 10 02-01 Temporal Difference Learning (2) Equivalent expression: TD() algorithm uses above training rule Sometimes converges faster than Q learning converges for learning V* for any 0    1 [Dayan, 1992] Tesauro's TD-Gammon uses this algorithm Machine Learning RL Elena Marchiori

Subtleties and Ongoing Research ML 10 02-01 Subtleties and Ongoing Research Replace table with neural net or other generalizer Handle case where state only partially observable (next?) Design optimal exploration strategies Extend to continuous action, state Learn and use Relationship to dynamic programming (next?) Machine Learning RL Elena Marchiori