CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Slides:



Advertisements
Similar presentations
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Advertisements

brings-uas-sensor-technology-to- smartphones/ brings-uas-sensor-technology-to-
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Eick: Reinforcement Learning. Topic 18: Reinforcement Learning 1. Introduction 2. Bellman Update 3. Temporal Difference Learning 4. Discussion of Project1.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Reinforcement Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Reinforcement Learning Tutorial
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Machine Learning Lecture 11: Reinforcement Learning
Chapter 6: Temporal Difference Learning
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning to get what you want... Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Policies and exploration and eligibility, oh my!.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.
Reinforcement Learning
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
Neural Networks Chapter 7
INTRODUCTION TO Machine Learning
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé By Sutton.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
Reinforcement Learning
CPSC 422, Lecture 10Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 10 Sep, 30, 2015.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Figure 5: Change in Blackjack Posterior Distributions over Time.
Reinforcement learning
An Overview of Reinforcement Learning
Reinforcement Learning
"Playing Atari with deep reinforcement learning."
Markov Decision Processes
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
Reinforcement Learning
Reinforcement learning
Chapter 4: Dynamic Programming
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Reinforcement Learning
Introduction to Reinforcement Learning and Q-Learning
Reinforcement Learning
Presentation transcript:

CHAPTER 16: Reinforcement Learning

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing: Sequence of moves to win a game Robot in a maze: Sequence of actions to find a goal Agent has a state in an environment, takes an action and sometimes receives reward and the state changes Credit-assignment Learn a policy

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 3 Single State: K-armed Bandit Among K levers, choose the one that pays best Q(a): value of action a Reward is r a Set Q(a) = r a Choose a * if Q(a * )=max a Q(a) Rewards stochastic (keep an expected reward):

Example: Q-Learning Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 4 Remarks: The learning rate  determines how quickly Q(s,a)changes based on “new evidence” The discount factor  determines the importance of future rewards

Example: The STU-World 1 R=+52 R=+13 R=+9 6 R=  9 R=  10R=+1 8 R=  5 R=+  4 R=+  7 R=  5 e e s s s nw x/0.9 n ne x/0.1 n y/0.1 n Problem: What actions should an agent choose to maximize its rewards ? ne Remark: no terminal states sw y/0.5 y/0.4 w

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 6 Elements of RL (Markov Decision Processes) s t : State of agent at time t a t : Action taken at time t In s t, action a t is taken, clock ticks and reward r t+1 is received and state changes to s t+1 Next state prob: P (s t+1 | s t, a t ) Reward prob: p (r t+1 | s t, a t ) Initial state(s), goal state(s) Episode (trial) of actions from initial state to goal (Sutton and Barto, 1998; Kaelbling et al., 1996)

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 7 Policy and Cumulative Reward Policy, Value of a policy, Finite-horizon: Infinite horizon:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 8

9 Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ), is known There is no need for exploration Can be solved using dynamic programming; e.g. Bellman update Solve for Optimal policy Model-Based Learning

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 10 Value Iteration Goal: Find the optimal Policy

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 11 Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ), is not known; model-free learning There is need for exploration to sample from P (s t+1 | s t, a t ) and p (r t+1 | s t, a t ) Use the reward received in the next time step to update the value of current state (action) The temporal difference between the value of the current action and the value discounted from the next state

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 12 Exploration Strategies ε -greedy: With pr ε,choose one action at random uniformly; and choose the best action with pr 1- ε Probabilistic: Move smoothly from exploration/exploitation. Decrease ε Annealing

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 13 Nondeterministic Rewards and Actions When next states and rewards are nondeterministic (there is an opponent or randomness in the environment), we keep averages (expected values) instead as assignments Q-learning (Watkins and Dayan, 1992): Off-policy vs on-policy (Sarsa) Learning V (TD-learning: Sutton, 1988) backup

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 14 Q-learning a’ is chosen based on maximum q value

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 15 Sarsa a’ is chosen based on policy