Presentation is loading. Please wait.

Presentation is loading. Please wait.

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Similar presentations


Presentation on theme: "CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:"— Presentation transcript:

1 CHAPTER 16: Reinforcement Learning

2 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing: Sequence of moves to win a game Robot in a maze: Sequence of actions to find a goal Agent has a state in an environment, takes an action and sometimes receives reward and the state changes Credit-assignment Learn a policy

3 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 3 Single State: K-armed Bandit Among K levers, choose the one that pays best Q(a): value of action a Reward is r a Set Q(a) = r a Choose a * if Q(a * )=max a Q(a) Rewards stochastic (keep an expected reward):

4 Example: Q-Learning Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 4 Remarks: The learning rate  determines how quickly Q(s,a)changes based on “new evidence” The discount factor  determines the importance of future rewards

5 Example: The STU-World 1 R=+52 R=+13 R=+9 6 R=  9 R=  10R=+1 8 R=  5 R=+  4 R=+  7 R=  5 e e s s s nw x/0.9 n ne x/0.1 n y/0.1 n Problem: What actions should an agent choose to maximize its rewards ? ne Remark: no terminal states sw y/0.5 y/0.4 w

6 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 6 Elements of RL (Markov Decision Processes) s t : State of agent at time t a t : Action taken at time t In s t, action a t is taken, clock ticks and reward r t+1 is received and state changes to s t+1 Next state prob: P (s t+1 | s t, a t ) Reward prob: p (r t+1 | s t, a t ) Initial state(s), goal state(s) Episode (trial) of actions from initial state to goal (Sutton and Barto, 1998; Kaelbling et al., 1996)

7 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 7 Policy and Cumulative Reward Policy, Value of a policy, Finite-horizon: Infinite horizon:

8 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 8

9 9 Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ), is known There is no need for exploration Can be solved using dynamic programming; e.g. Bellman update Solve for Optimal policy Model-Based Learning

10 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 10 Value Iteration Goal: Find the optimal Policy

11 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 11 Temporal Difference Learning Environment, P (s t+1 | s t, a t ), p (r t+1 | s t, a t ), is not known; model-free learning There is need for exploration to sample from P (s t+1 | s t, a t ) and p (r t+1 | s t, a t ) Use the reward received in the next time step to update the value of current state (action) The temporal difference between the value of the current action and the value discounted from the next state

12 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 12 Exploration Strategies ε -greedy: With pr ε,choose one action at random uniformly; and choose the best action with pr 1- ε Probabilistic: Move smoothly from exploration/exploitation. Decrease ε Annealing

13 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 13 Nondeterministic Rewards and Actions When next states and rewards are nondeterministic (there is an opponent or randomness in the environment), we keep averages (expected values) instead as assignments Q-learning (Watkins and Dayan, 1992): Off-policy vs on-policy (Sarsa) Learning V (TD-learning: Sutton, 1988) backup

14 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 14 Q-learning a’ is chosen based on maximum q value

15 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 15 Sarsa a’ is chosen based on policy


Download ppt "CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:"

Similar presentations


Ads by Google