Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Decision Theoretic Planning
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Markov Decision Processes
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Distributed Q Learning Lars Blackmore and Steve Block.
1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MAKING COMPLEX DEClSlONS
Machine Learning Chapter 13. Reinforcement Learning
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Reinforcement Learning
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
Solving POMDPs through Macro Decomposition
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday 29 October 2002 William.
INTRODUCTION TO Machine Learning
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Reinforcement Learning
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement Learning
Announcements Grader office hours posted on course website
Making complex decisions
Reinforcement Learning
Markov Decision Processes
Reinforcement Learning
Biomedical Data & Markov Decision Process
Reinforcement Learning
Markov Decision Processes
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
CAP 5636 – Advanced Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
Learning to Maximize Reward: Reinforcement Learning
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri

Agenda  Reinforcement Learning Scenario  Dynamic Programming  Monte-Carlo Methods  Temporal Difference Learning

Reinforcement Learning Scenario Agent Environment state s t s t+1 rtrt action a t s0s0 a0a0 r0r0 s1s1 a1a1 r1r1 s2s2 a2a2 r2r2 Goal: Learn to choose actions a t that maximize future rewards r 0 +  r 1 +  2 r 2 +… where 0 <  < 1 is a discount factor s3s3

Reinforcement Learning Scenario  Some application domains Robot learning to dock to a battery station Learning to choose actions to optimize a factory output Learning to play Backgammon  Characteristics Delayed reward No direct feedback (error signal) for good and bad actions Opportunity for active exploration Possibility that state is only partially observable

Example [Tesauro 1995]  Learning to play backgammon +100 if win -100 if lose 0 for all other states  Trained by playing 1.5 million games against itself  Now approximately equals the best human player

Markov Decision Process (MDP)  Finite set of states S  Finite set of actions A  At each time step the agent: observes state s t  S and chooses action a t  A(s t ) then receives immediate reward r t and state changes to s t+1  Markov assumption: r t = r(s t,a t ) and s t+1 =  (s t,a t ) Reward and next state only depend on current state s t and action a t Functions  (s t,a t ) and r(s t,a t ) may be non-deterministic Functions  (s t,a t ) and r(s t,a t ) not necessarily known to agent

Learning Task  Execute actions in the environment, observe results and  Learn a policy  : S  A that associates to every state s  S an action a  A so as to maximize the expected reward E[r t +  r t+1 +  2 r t+2 +…] from any starting state s  NOTE: 0 <  < 1 is the discount factor for future rewards Target function is  : S  A But there are no direct training examples of the form Training examples are of the form,r>

Learning task  Consider deterministic environments, namely  (s,a) and r(s,a) are deterministic functions of s and a  To evaluate a given policy  : S  A the agent might adopt the discounted cumulative reward over time: V  (s) = r 0 +  r 1 +   r 2 +…=  i=0…∞ r i  i where r o, r 1,… are generated by following the policy  from start state s  Task: Learn the optimal policy  * that maximizes V  (s)  * = argmax  V  (s),  s

Alternative definitions  Finite horizon reward  i=0…h r i  Average reward lim h  ∞  i=0…h r i

State and Action Value Functions  State value function denotes the reward for starting in state s and following policy  V  (s) = r 0 +  r 1 +   r 2 +…=  i=0  i r i  Action value function denotes the reward for starting in state s, taking action a and following policy  afterwards Q  (s,a)= r(s,a) +  r 1 +   r 2 +…= r(s,a) +  V  (  (s,a))  NOTE: a dynamical programming approach  * (s) = argmax a (r(s,a) +  V * (  (s,a))

Example G: terminal state, upon entering G agent obtains a reward of +100, remains in G forever and obtains no further rewards Rewards for other actions are 0  = 0.9 G +100 s3s3 s2s2 s6s6

Example One optimal policy G

Example V* values for each state G

Example Q(s,a) values G

What to learn  We might try to have agent learn the evaluation function V*  It could then do a lookahead search to choose best action from any state s because  * (s) = argmax a (r(s,a) +  V * (  (s,a))  But: This works well if agent knows functions  and r When it doesn’t, it can’t choose actions this way

Q function  Since  * (s) = argmax a Q(s,a) if agent learns Q, it can choose optimal action even without knowing 

Learning Q function  Note Q and V* are closely related V  (s) = max a Q(s,a)  Which allows us to write Q recursively as Q(s t,a t ) = r(s t,a t ) +  V*(  (s t,a t ))) = r(s t,a t ) +   max a’ Q(s t+1,a’)  Let ^Q denote learner’s current approximation to Q. Consider training rule ^Q := r + max a’ ^ Q(s’,a’) where s’ is the state resulting from applying action a in state s

Learning Q function  For each s, a do^Q := 0  Observe current state s  Do forever Select an action a and execute it Receive immediate reward r Observe new state s’ Update ^Q(s,a) := r +  max^Q(s’,a’) s := s’

Iterative Policy Evaluation  Bellman equation as an update rule for action-value function: Q k+1  (s,a) = r(s,a) +   a’   (  (s,a),a’) Q k  (  (s,a),a’)  =0.9 0 G G G G G G G

Policy Improvement  Suppose we have determined the value function V  for an arbitrary deterministic policy .  For some state s we would like to know if it is better to choose an action a  (s).  Select a and follow the existing policy  afterwards gives us reward Q  (s,a)  If Q  (s,a) > V  then a is obviously better than  (s)  Therefore choose new policy  ’ as  ’(s)=argmax a Q  (s,a) = argmax a r(s,a)+   V  (  (s,a))

Example  ’(s)=argmax a r(s,a)+   V  (  (s,a)) V  =0V  =71V  =63 V  =56V  =61V  =78 r=100  (s,a)=1/|a| V  ’= 0 V  ’= 100 V  ’= 90 V  ’= 81 V  ’= 90 V  ’= 100

Example  ’(s)=argmax a Q  (s,a) 0 G

Generalized Policy Iteration  Intertwine policy evaluation with policy improvement  0  V  0   1  V  1   2  V  2  …   *  V  * E I E I E I … I E  V evaluation improvement V  V   greedy(V)

Value Iteration (Q-Learning)  Idea: do not wait for policy evaluation to converge, but improve policy after each iteration. V k+1  (s) = max a  (r(s,a) +   V k  (  (s,a))) or Q k+1  (s,a) = r(s,a) +   max a’ Q k  (  (s,a),a’) Stop when  s |V k+1  (s)- V k  (s)| <  or  s,a |Q k+1  (s,a)- Q k  (s,a)| < 