© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.

Slides:



Advertisements
Similar presentations
Advanced Piloting Cruise Plot.
Advertisements

Chapter 1 The Study of Body Function Image PowerPoint
UNITED NATIONS Shipment Details Report – January 2006.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
Peer-to-peer and agent-based computing Basic Theory of Agency.
VARUN GUPTA Carnegie Mellon University 1 Partly based on joint work with: Anshul Gandhi Mor Harchol-Balter Mike Kozuch (CMU) (CMU) (Intel Research)
Comp 122, Spring 2004 Order Statistics. order - 2 Lin / Devi Comp 122 Order Statistic i th order statistic: i th smallest element of a set of n elements.
Discrete time Markov Chain
Randomized Algorithms Randomized Algorithms CS648 1.
Detection Chia-Hsin Cheng. Wireless Access Tech. Lab. CCU Wireless Access Tech. Lab. 2 Outlines Detection Theory Simple Binary Hypothesis Tests Bayes.
Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.
ABC Technology Project
Hash Tables.
Online Algorithm Huaping Wang Apr.21
5-1 Chapter 5 Theory & Problems of Probability & Statistics Murray R. Spiegel Sampling Theory.
THE PRICE OF STOCHASTIC ANARCHY Christine ChungUniversity of Pittsburgh Katrina LigettCarnegie Mellon University Kirk PruhsUniversity of Pittsburgh Aaron.
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
VOORBLAD.
Reinforcement Learning
©2007 First Wave Consulting, LLC A better way to do business. Period This is definitely NOT your father’s standard operating procedure.
Chapter 5 Microsoft Excel 2007 Window
Squares and Square Root WALK. Solve each problem REVIEW:
Lecture 18: Temporal-Difference Learning
© 2012 National Heart Foundation of Australia. Slide 2.
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
Reaching Agreements II. 2 What utility does a deal give an agent? Given encounter  T 1,T 2  in task domain  T,{1,2},c  We define the utility of a.
Addition 1’s to 20.
25 seconds left…...
Week 1.
We will resume in: 25 Minutes.
Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8
TASK: Skill Development A proportional relationship is a set of equivalent ratios. Equivalent ratios have equal values using different numbers. Creating.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 14 From Randomness to Probability.
1 Complexity ©D.Moshkovitz Cryptography Where Complexity Finally Comes In Handy…
Basics of Statistical Estimation
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Probabilistic Reasoning over Time
NON - zero sum games.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning (1)
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning (RL) Consider an “agent” embedded in an environmentConsider an “agent” embedded in an environment Task of the agentTask of the agent.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Markov Decision Process (MDP)
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement Learning (RL)
Today’s Topics Reinforcement Learning (RL) Q learning
Reinforcement Learning
October 6, 2011 Dr. Itamar Arel College of Engineering
Presentation transcript:

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded in an environmentConsider an “agent” embedded in an environment Task of the agentTask of the agent Repeat forever: 1)sense world 2)reason 3)choose an action to perform

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 2 Definition of RL Assume the world (ie, environment) periodically provides rewards or punishments (“reinforcements”)Assume the world (ie, environment) periodically provides rewards or punishments (“reinforcements”) Based on reinforcements received, learn how to better choose actionsBased on reinforcements received, learn how to better choose actions

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 3 Sequential Decision Problems Courtesy of A.G. Barto, April 2000 Decisions are made in stagesDecisions are made in stages The outcome of each decision is not fully predictable but can be observed before the next decision is madeThe outcome of each decision is not fully predictable but can be observed before the next decision is made The objective is to maximize a numerical measure of total reward (or equivalently, to minimize a measure of total cost)The objective is to maximize a numerical measure of total reward (or equivalently, to minimize a measure of total cost) Decisions cannot be viewed in isolation:Decisions cannot be viewed in isolation: need to balance desire for immediate reward with possibility of high future reward

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 4 Reinforcement Learning vs Supervised Learning How would we use SL to train an agent in an environment?How would we use SL to train an agent in an environment? Show action to choose in sample of world states – “I/O pairs”Show action to choose in sample of world states – “I/O pairs” RL requires much less of teacherRL requires much less of teacher Must set up “reward structure”Must set up “reward structure” Learner “works out the details” – i.e. writes a program to maximize rewards receivedLearner “works out the details” – i.e. writes a program to maximize rewards received

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 5 Embedded Learning Systems: Formalization S E = the set of states of the worldS E = the set of states of the world e.g., an N -dimensional vectore.g., an N -dimensional vector “sensors”“sensors” A E = the set of possible actions an agent can performA E = the set of possible actions an agent can perform “effectors”“effectors” W = the worldW = the world R = the immediate reward structureR = the immediate reward structure W and R are the environment, can be probabilistic functions

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 6 Embedded learning Systems (formalization) W: S E x A E  S E The world maps a state and an action and produces a new state R: S E x A E  “reals” Provides rewards (a number) as a function of state and action (as in textbook). Can equivalently formalize as a function of state (next state) alone.

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 7 A Graphical View of RL Note that both the world and the agent can be probabilistic, so W and R could produce probability distributions.Note that both the world and the agent can be probabilistic, so W and R could produce probability distributions. For now, assume deterministic problemsFor now, assume deterministic problems The real world, W The Agent an action sensory info R, reward (a scalar) - indirect teacher

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 8 Common Confusion State need not be solely the current sensor readings Markov AssumptionMarkov Assumption Value of state is independent of path taken to reach that state Can have memory of the pastCan have memory of the past Can always create Markovian task by remembering entire past history

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 9 Need for Memory: Simple Example “out of sight, but not out of mind” T=1 learning agent W A L L opponent T=2 learning agent W A L L opponent Seems reasonable to remember opponent recently seen

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 10 State vs. Current Sensor Readings Remember state is what is in one’s head (past memories, etc) not ONLY what one currently sees/hears/smells/etc

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 11 Policies The agent needs to learn a policy  E : S E  A E Given a world state, S E, which action, A E, should be chosen? The policy,  E, function Remember: The agent’s task is to maximize the total reward received during its lifetime

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 12 Policies (cont.) To construct  E, we will assign a utility (U) (a number) to each state. -  is a positive constant < 1 -R(s,  E, t) is the reward received at time t, assuming the agent follows policy  E and starts in state s at t=0 -Note: future rewards are discounted by  t-1

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 13 The Action-Value Function We want to choose the “best” action in the current state So, pick the one that leads to the best next state (and include any immediate reward) Let immediate reward received for going to state W(s,a) Future reward from further actions (discounted due to 1- step delay)

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 14 The Action-Value Function (cont. ) If we can accurately learn Q (the action- value function), choosing actions is easy Choose a, where

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 15 Q vs. U Visually U(1) U(3) U(2) U(4) U(5) U(6) state Q(1,i) Q(1,ii) Q(1,iii) state action Key states actions U’s “stored” on states Q’s “stored” on arcs

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 16 Q-Learning (Watkins PhD, 1989) Let Q t be our current estimate of the optimal Q Our current policy is such that Our current utility-function estimate is - hence, the U table is embedded in the Q table and we don’t need to store both

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 17 Q-Learning (cont.) Assume we are in state S t “Run the program” (1) for awhile (n steps) Determine actual reward and compare to predicted reward Adjust prediction to reduce error (1 ) I.e., follow the current policy

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 18 How Many Actions Should We Take Before Updating Q ? Why not do so after each action? “1 – Step Q learning”“1 – Step Q learning” Most common approachMost common approach

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 19 Exploration vs. Exploitation In order to learn about better alternatives, we can’t always follow the current policy (“exploitation”) Sometimes, need to try “random” moves (“exploration”)

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 20 Exploration vs. Exploitation (cont) Approaches 1) p percent of the time, make a random move; could let 2) Prob(picking action A in state S ) A in state S ) Exponentia- ting gets rid of negative values

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 21 0.S  initial state 1. If random #  P then a = random choice Else a =  t (S) 2. S new  W(S, a) R immed  R(S new ) 3.Q(S, a)  R immed +  max a’ Q(S new, a’) 4. S  S new Go to 1Go to 1 One-Step Q-Learning Algo Act on world and get reward

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 22 A Simple Example (of Q-learning - with updates after each step, ie N =1) Repeat (deterministic world, so α =1) Algo: Pick State +Action S 0 R = 0 S 1 R = 1 S 3 R = 0 S 2 R = -1 S 4 R = 3 Let  = 2/3 Q = 0

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 23 A Simple Example (Step 1) S 0  S 2 Repeat (deterministic world, so α =1) Algo: Pick State +Action S 0 R = 0 S 1 R = 1 S 3 R = 0 S 2 R = -1 S 4 R = 3 Let  = 2/3 Q = 0 Q = -1

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 24 A Simple Example (Step 2) S 2  S 4 Repeat (deterministic world, so α =1) Algo: Pick State +Action S 0 R = 0 S 1 R = 1 S 3 R = 0 S 2 R = -1 S 4 R = 3 Let  = 2/3 Q = 0 Q = 3 Q = 0 Q = -1

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 25 A Simple Example (Step  ) Repeat (deterministic world, so α =1) Algo: Pick State +Action S 0 R = 0 S 1 R = 1 S 3 R = 0 S 2 R = -1 S 4 R = 3 Let  = 2/3 Q = 1 Q = 3 Q = 0 Q = 1

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 26 Q-Learning: Implementation Details Remember, conceptually we are filling in a huge table S0 S1 S2... Sn abc...zabc...z... Q(S2, c) Tables are a very verbose representation of a function States Actions Actions

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 27 Q-Learning: Convergence Proof Applies to Q tables and deterministic, Markovian worlds. Initialize Q’s 0 or random finite.Applies to Q tables and deterministic, Markovian worlds. Initialize Q’s 0 or random finite. Theorem: if every state-action pair visited infinitely often, 0≤  <1, and |rewards| ≤ C (some constant), thenTheorem: if every state-action pair visited infinitely often, 0≤  <1, and |rewards| ≤ C (some constant), then  s, a the approx. Q table (Q) the true Q table (Q) ^

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 28 Q-Learning Convergence Proof (cont.) Consider the max error in the approx. Q-table at step t :Consider the max error in the approx. Q-table at step t : The max is finite since |r| ≤ C, so max | |The max is finite since |r| ≤ C, so max | | Since finite, we have finite, i.e. initial max error is finiteSince finite, we have finite, i.e. initial max error is finite

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 29 Q-Learning Convergence Proof (cont.) Let s’ be the state that results from doing action a in state s. Consider what happens when we visit s and do a at step t + 1: Current state Next state By Q-learning rule (one step) By def’n of Q (notice best a in s’ might be different)

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 30 Q-Learning Convergence Proof (cont.) By algebra Since Max at s’ ≤ max at any s Plugging in defn of Δ t Trickiest step, can prove by contradiction =  | max a’ Q t (s’, a’) – max a’’ Q(s’, a’’) | =  Δ t ≤  max a’’’ | Q t (s’, a’’’) – Q(s’, a’’’) | ≤  max s’’,a’’’ | Q t (s’’, a’’’) – Q(s’’, a’’’) | ^ ^ ^

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 31 Q-Learning Convergence Proof (cont.) Hence, every time, after t, we visit an, its Q value differs from the correct answer by no more than  Δ tHence, every time, after t, we visit an, its Q value differs from the correct answer by no more than  Δ t Let T o =t o (i.e. the start) and T N be the first time since T N-1 where every visited at least onceLet T o =t o (i.e. the start) and T N be the first time since T N-1 where every visited at least once Call the time between T N-1 and T N, a complete intervalCall the time between T N-1 and T N, a complete interval Clearly Δ T N Δ T N-1 Clearly Δ T N ≤  Δ T N-1

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 32 Q-Learning Convergence Proof (concluded) That is, every complete interval, Δ t is reduced by at least That is, every complete interval, Δ t is reduced by at least  Since we assumed every pair visited infinitely often, we will have an infinite number of complete intervalsSince we assumed every pair visited infinitely often, we will have an infinite number of complete intervals Hence, lim Δ t = 0 t   t  

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide Q (S, a) Q (S, b) Q (S, z) Representing Q Functions More Compactly We can use some other function representation (eg, neural net) to compactly encode this big table An encoding of the state (S) Second argument is a constant Or could have one net for each possible action Each input unit encodes a property of the state (eg, a sensor value)

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 34 Q Tables vs Q Nets Given: 100 Boolean-valued features 10 possible actions 10 possible actions Size of Q table 10 * 2 to the power of 100 Size of Q net (100 HU’s) 100 * * 10 = 11,000 # of possible states Weights between inputs and HU’s Weights between HU’s and outputs

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 35 Why Use a Compact Q-Function? 1.Full Q table may not fit in memory for realistic problems 2.Can generalize across states, thereby speeding up convergence i.e., one example “fills” many cells in the Q table