1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
10/29/01Reinforcement Learning in Games 1 Colin Cherry Oct 29/01.
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2008.
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Reinforcement Learning
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Reinforcement Learning
1 Machine Learning: Symbol-based 9 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.
Reinforcement Learning
Chapter 1: Introduction
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Introduction Presented by Alp Sardağ.
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2006.
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Game Playing: Adversarial Search Chapter 6. Why study games Fun Clear criteria for success Interesting, hard problems which require minimal “initial structure”
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
CSC 412: AI Adversarial Search
MAKING COMPLEX DEClSlONS
Machine Learning Chapter 13. Reinforcement Learning
Game Playing. Introduction Why is game playing so interesting from an AI point of view? –Game Playing is harder then common searching The search space.
Game Playing.
Reinforcement Learning
Introduction Many decision making problems in real life
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction Ann Nowé By Sutton and.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
George F Luger ARTIFICIAL INTELLIGENCE 6th edition Structures and Strategies for Complex Problem Solving Machine Learning: Symbol-Based Luger: Artificial.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza.
Game Playing. Introduction One of the earliest areas in artificial intelligence is game playing. Two-person zero-sum game. Games for which the state space.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday 29 October 2002 William.
INTRODUCTION TO Machine Learning
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 9 of 42 Wednesday, 14.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Course Overview  What is AI?  What are the Major Challenges?  What are the Main Techniques?  Where are we failing, and why?  Step back and look at.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Reinforcement Learning
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Chapter 5 Adversarial Search. 5.1 Games Why Study Game Playing? Games allow us to experiment with easier versions of real-world situations Hostile agents.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Adversarial Search and Game-Playing
Making complex decisions
Reinforcement Learning
Artificial Intelligence
Markov Decision Processes
Markov Decision Processes
Announcements Homework 3 due today (grace period through Friday)
Artificial Intelligence
Chapter 1: Introduction
CS 416 Artificial Intelligence
Unit II Game Playing.
Presentation transcript:

1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction Algorithm 9.4Inductive Bias and Learnability 9.5Knowledge and Learning 9.6Unsupervised Learning 9.7Reinforcement Learning 9.8Epilogue and References 9.9Exercises Additional references for the slides: Thomas Dean, James Allen, and Yiannis Aloimonos, Artificial Intgelligence: Theory and Practice Addison Wesley, 1995, Section 5.9.

2 Reinforcement Learning A form of learning where the agent can explore and learn through interaction with the environment The agent learns a policy which is a mapping from states to actions. The policy tells what the best move is in a particular state. It is a general methodology: planning, decision making, search can all be viewed as some form of the reinforcement learning.

3 Tic-tac-toe: a different approach Recall the minimax approach: The agent knows its current state. Generates a two layer search tree taking into account all the possible moves for itself and the opponent. Backs up values from the leaf nodes and takes the best move assuming that the opponent will also do so. An alternative is to directly start playing with an opponent (does not have to be perfect, but could as well be). Assume no prior knowledge or lookahead. Assign “values” to states:1 is win 0 is loss or draw 0.5 is anything else

Notice that 0.5 is arbitrary, it cannot differentiate between good moves and bad moves. So, the learner has no guidance initially. It engages in playing. When the game ends, if it is a win, the value 1 will be propagated backwards. If it is a draw or a loss, the value 0 is propagated backwards. Eventually, earlier states will be labeled to reflect their “true” value. After several plays, the learner will learn the best move given a state (a policy.)

5 Issues in generalizing this approach How will the state values be initialized or propagated backwards? What if there is no end to the game (infinite horizon)? This is an optimization problem which suggests that it is hard. How can an optimal policy be learned?

6 A simple robot domain The robot is in one of the states: 0, 1, 2, 3. Each one represents an office, the offices are connected in a ring. Three actions are available: + moves to the “next” state - moves to the “previous” remains at the same state

7 The robot domain (cont’d) The robot can observe the label of the state it is in and perform any action corresponding to an arc leading out of its current state. We assume that there is a clock governing the passage of time, and that at each tick of the clock the robot has to perform an action. The environment is deterministic, there is a unique state resulting from any initial state and action. Each state has a reward: 10 for state 3, 0 for the others.

8 The reinforcement learning problem Given information about the environment  States  Actions  State-transition function (or diagram) Output a policy p: states → actions, i.e., find the best action to execute at each state Assumes that the state is completely observable (the agent always knows which state it is in)

9 Compare three policies a. Every state is mapped The value of this policy is 0, because the robot will never get to office 3. b. Every state is mapped to + policy 0 The value of this policy is , because the robot will end up in office 3 infinitely often. c. Every state is except 3 is mapped to +, 3 is mapped policy 1 The value of this policy is also , because the robot will end up (stay) in office 3 infinitely often.

10 Compare three policies So, it is easy to rule case a out, but how can we show that policy 1 is better than policy 0? One way would be to compute the average reward per tick: POLICY 1 The average reward per tick for state 0 is 10. POLICY 0 The average reward per tick for state 0 is 10/4. Another way would be to assign higher values for immediate rewards and apply a discount to future rewards.

11 Discounted cumulative reward Assume that the robot associates a higher value with more immediate rewards and therefore discounts future rewards. The discount rate (  ) is a number between 0 and 1 used to discount future rewards. The discounted cumulative reward for a particular state with respect to a given policy is the sum for n from 0 to infinity of  n times the reward associated with the state reached after the n-th tick of the clock. POLICY 1 The discounted cumulative reward for state 0 is 2.5. POLICY 0 The discounted cumulative reward for state 0 is 1.33.

12 Discounted cumulative reward (cont’d) Take  = 0.5 For state 0 with respect to policy 0: x x x x x x x x 10 + … = … = 1.33 in the limit For state 0 with respect to policy 1: x x x x x x x x 10 + … = 2.5 in the limit

13 Discounted cumulative reward (cont’d) Let j be a state, R(j) be the reward for ending up in state j,  be a fixed policy,  (j) be the action dictated by  in state j, f(j,a) be the next state given the robot starts in state j and performs action a, V  i (j) be the estimated value of state j with respect to the policy  after the i-th iteration of the algorithm Using a dynamic programming algorithm, one can obtain a good estimate of V , the value function for policy  as i  .

14 A dynamic programming algorithm to compute values for states for a policy  1. For each j, set V  0 (j) to Set i to For each j, set V  i+1 (j) to R(j) +  V  i ( f(j,  ) ) ). 4. Set i to i If i is equal to the maximum number of iterations, then return V  i otherwise, return to step 3.

15 Values of states for policy 0 initialize  V(0) = 0  V(1) = 0  V(2) = 0  V(3) = 0 iteration 0  For office 0: R(0) +  V(1) = x 0 = 0  For office 1: R(1) +  V(2) = x 0 = 0  For office 2: R(2) +  V(3) = x 0 = 0  For office 3: R(3) +  V(1) = x 0 = 10  (iteration 0 essentially initializes values of states to their immediate rewards)

16 Values of states for policy 0 (cont’d) iteration 0 V(0) = V(1) = V(2) = 0 V(3)=10 iteration 1  For office 0: R(0) +  V(1) = x 0 = 0  For office 1: R(1) +  V(2) = x 0 = 0  For office 2: R(2) +  V(3) = x 10 = 5  For office 3: R(3) +  V(0) = x 0 = 10 iteration 2  For office 0: R(0) +  V(1) = x 0 = 0  For office 1: R(1) +  V(2) = x 5 = 2.5  For office 2: R(2) +  V(3) = x 10 = 5  For office 3: R(3) +  V(0) = x 0 = 10

17 Values of states for policy 0 (cont’d) iteration 2 V(0) = 0 V(1) = 2.5 V(2) = 5 V(3) = 10 iteration 4  For office 0: R(0) +  V(1) = x 2.5 = 1.25  For office 1: R(1) +  V(2) = x 5 = 2.5  For office 2: R(2) +  V(3) = x 10 = 5  For office 3: R(3) +  V(0) = x 0 = 10 iteration 5  For office 0: R(0) +  V(1) = x 2.5 = 1.25  For office 1: R(1) +  V(2) = x 5 = 2.5  For office 2: R(2) +  V(3) = x 10 = 5  For office 3: R(3) +  V(1) = x 1.25 =

18 Values of states for policy 1 initialize  V(0) = 0  V(1) = 0  V(2) = 0  V(3) = 0 iteration 0  For office 0: R(0) +  V(1) = x 0 = 0  For office 1: R(1) +  V(2) = x 0 = 0  For office 2: R(2) +  V(3) = x 0 = 0  For office 3: R(3) +  V(3) = x 0 = 10

19 Values of states for policy 1 (cont’d) iteration 0 V(0) = V(1) = V(2) = 0 V(3)=15 iteration 1  For office 0: R(0) +  V(1) = x 0 = 0  For office 1: R(1) +  V(2) = x 0 = 0  For office 2: R(2) +  V(3) = x 10 = 5  For office 3: R(3) +  V(3) = x 10 = 15 iteration 2  For office 0: R(0) +  V(1) = x 0 = 0  For office 1: R(1) +  V(2) = x 5 = 2.5  For office 2: R(2) +  V(3) = x 15 = 7.5  For office 3: R(3) +  V(3) = x 15 = 17.5

20 Values of states for policy 1 (cont’d) iteration 2 V(0) = 0 V(1) = 2.5 V(2) = 7.5 V(3) = 17.5 iteration 4  For office 0: R(0) +  V(1) = x 2.5 = 1.25  For office 1: R(1) +  V(2) = x 7.5 = 3.75  For office 2: R(2) +  V(3) = x 17.5 = 8.75  For office 3: R(3) +  V(3) = x 17.5 = iteration 5  For office 0: R(0) +  V(1) = x 3.75 =  For office 1: R(1) +  V(2) = x 8.75 =  For office 2: R(2) +  V(3) = x =  For office 3: R(3) +  V(3) = x =

21 Compare policies Policy 0 after iteration 5  For office 0: R(0) +  V(1) = x 2.5 = 1.25  For office 1: R(1) +  V(2) = x 5 = 2.5  For office 2: R(2) +  V(3) = x 10 = 5  For office 3: R(3) +  V(1) = x 1.25 = Policy 1 after iteration 5  For office 0: R(0) +  V(1) = x 3.75 =  For office 1: R(1) +  V(2) = x 8.75 =  For office 2: R(2) +  V(3) = x =  For office 3: R(3) +  V(3) = x = Policy 1 is better because each state has higher value compared to policy 0

22 Temporal credit assignment problem It is the problem of assigning credit or blame to the actions in a sequence of actions where feedback is available only at the end of the sequence. When you lose a game of chess or checkers, the blame for your loss cannot necessarily be attributed to the last move you made, or even the next-to-the-last move. Dynamic programming solves the temporal credit assignment problem by propagating rewards backwards to earlier states and hence to actions earlier in the sequence of actions determined by a policy.

23 Computing an optimal policy Given a method for estimating the value of states with respect to a fixed policy, it is possible to find an optimal policy. We would like to maximize the discounted cumulative reward. Policy iteration [Howard, 1960] is an algorithm that uses the algorithm for computing the value of a state as a subroutine.

24 Policy iteration algorithm 1. Let  0 be an arbitrary policy. 2. Set i to Compute V  0 (j) for each j. 4. Compute a new policy  i+1 so that  i+1 (j) is the action a maximizing R(j) +  V  i ( f(j,  ) ). 5. If  i+1 =  i, then return  i ; otherwise, set i to i + 1, and go to step 3.

25 Policy iteration algorithm (cont’d) A policy  is said to be the optimal policy if there is no other policy  ’ and state j such that V  ’ (j) > V  (j) and for all k  j V  ’ (j) > V  (j). The policy iteration algorithm is guaranteed to terminate in a finite number of steps with an optimal policy.

26 Comments on reinforcement learning A general model where an agent can learn to function in dynamic environments The agent can learn while interacting with the environment No prior knowledge except the (probabilistic) transitions is assumed Can be generalized to stochastic domains (an action might have several different probabilistic consequences, i.e., the state-transition function is not deterministic) Can also be generalized to domains where the reward function is not known

27 Famous example: TD-Gammon (Tosauro, 1995) Learns to play Backgammon Immediate reward: +100 if win -100 if lose 0 for all other states Trained by playing 1.5 million games against itself (several weeks) Now approximately equal to best human player (won World Cup of Backgammon in 1992; among top 3 since 1995) Predecessor: NeuroGammon [Tesauro and Sejnowski, 1989] learned from examples of labelled moves (very tedious for human expert)

28 Other examples Robot learning to dock on battery charger Pole balancing Elevator dispatching [Crites and Barto, 1995]: better than industry standard Inventory management [Van Roy et. Al]: 10-15% improvement over industry standards Job-shop scheduling for NASA space missions [Zhang and Dietterich, 1997] Dynamic channel assignment in cellular phones [Singh and Bertsekas, 1994] Robotic soccer

29 Common characteristics delayed reward opportunity for active exploration possibility that state only partially observable possible need to learn multiple tasks with same sensors/effectors there may not be an adequate teacher