Presentation is loading. Please wait.

Presentation is loading. Please wait.

Today’s Topics Reinforcement Learning (RL) Q learning

Similar presentations


Presentation on theme: "Today’s Topics Reinforcement Learning (RL) Q learning"— Presentation transcript:

1 Today’s Topics Reinforcement Learning (RL) Q learning
CS 540 Fall 2015 (Shavlik) 11/20/2018 Today’s Topics Reinforcement Learning (RL) Q learning Exploration vs Exploitation Generalizing Across State Used in Clinical Trials Recently Increased Emphasis in IT Companies Inverse RL 1/19/17 Spring 2017 (Shavlik©)

2 Reinforcement Learning vs Supervised Learning
RL requires much less of teacher Teacher must set up ‘reward structure’ Learner ‘works out the details’ ie, writes a program to maximize rewards received 1/19/17 Spring 2017 (Shavlik©)

3 Sequential Decision Problems
Courtesy of Andy Barto (pictured) Decisions are made in stages The outcome of each decision is not fully predictable, but can be observed before the next decision is made The objective is to maximize a numerical measure of total reward (or equivalently, to minimize a measure of total cost) Decisions cannot be viewed in isolation need to balance desire for immediate reward with possibility of high future reward 1/19/17 Spring 2017 (Shavlik©)

4 RL Systems: Formalization
SE = the set of states of the world eg, an N -dimensional vector ‘sensors’ (and memory of past sensations) AE = the set of possible actions an agent can perform (‘effectors’) W = the world R = the immediate reward structure W and R are the environment, can be stochastic functions (usually most states have R=0; ie rewards are sparse) 1/19/17 Spring 2017 (Shavlik©)

5 Embedded Learning Systems: Formalization (cont.)
W: SE x AE  SE [here the arrow means ‘maps to’] The world maps a state and an action and produces a new state R: SE  “reals” Provides rewards (a number; often 0) as a function of the current state Note: we can instead use R: SE x AE  “reals” (ie, rewards depend on how we ENTER a state) 1/19/17 Spring 2017 (Shavlik©)

6 A Graphical View of RL Note that both the world and the agent can be probabilistic, so W and R could produce probability distributions We’ll assume deterministic problems The real world, W R, reward (a scalar) - indirect teacher an action sensory info The Agent 1/19/17 Spring 2017 (Shavlik©)

7 Common Confusion State need not be solely the current sensor readings
Markov Assumption commonly used in RL Value of state is independent of path taken to reach that state But can store memory of the past in current state Can always create Markovian task by remembering entire past history 1/19/17 Spring 2017 (Shavlik©)

8 Need for Memory: Simple Example
‘Out of sight, but not out of mind’ Time=1 learning agent opponent W A L L opponent Time=2 learning agent W A L L Seems reasonable to remember opponent was recently seen 1/19/17 Spring 2017 (Shavlik©)

9 State vs. Current Sensor Readings
Remember state is what is in one’s head (past memories, etc) not ONLY what one currently sees/hears/smells/etc 1/19/17 Spring 2017 (Shavlik©)

10 Policies The agent needs to learn a policy E : ŜE  AE
Given a world state, ŜE, which action, AE, should be chosen? ŜE is our learner’s approximation to the true SE The policy, E, function Remember: The agent’s task is to maximize the total reward received during its lifetime 1/19/17 Spring 2017 (Shavlik©)

11 True World States vs. the Learner’s Representation of the World State
From here forward, S, will be our learner’s approximation of the true world state Exceptions W: S x A  S R: S  reals These are our notations for how the true world behaves when we act upon it You can think that W and R take as an argument the learner’s representation of the world state and internally convert that to the ‘true’ world state(s) 1/19/17 Spring 2017 (Shavlik©)

12 Policies (cont.) To construct E, we will assign a utility (U) (a number) to each state  is a positive constant ≤ 1 R(s, E, t) is the reward received at time t, assuming the agent follows policy E and starts in state s at t=0 Note: future rewards are discounted by  t-1 1/19/17 Spring 2017 (Shavlik©)

13 Why have a Decay on Rewards?
Why have a Decay on Rewards? Getting ‘money’ in the future worth less than money right now Inflation More time to enjoy what it buys Risk of death before collecting Allows convergence proofs of the functions we’re learning 1/19/17 Spring 2017 (Shavlik©)

14 The Action-Value Function
We want to choose the ‘best’ action in the current state So, pick the one that leads to the best next state (and include any immediate reward) Let Immediate reward received for going to state W(s,a) [Alternatively, R(s, a) ] Future reward from further actions (discounted due to 1-step delay) 1/19/17 Spring 2017 (Shavlik©)

15 The Action-Value Function (cont.)
If we can accurately learn Q (the action-value function), choosing actions is easy Choose action a, where Note: x = argmax f(x) sets x to the value that leads to a max value for f(x) 1/19/17 Spring 2017 (Shavlik©)

16 Q vs. U Visually action state state U’s ‘stored’ on states
Key Q(1,i) states U(5) actions U(1) Q(1,ii) U(3) U(6) Q(1,iii) U(4) U’s ‘stored’ on states Q’s ‘stored’ on arcs 1/19/17 Spring 2017 (Shavlik©)

17 Q’s vs. U’s Assume we’re in state S Which action do we choose?
U’s (Model-based) Need to have a ‘next state’ function to generate all possible next states (eg, chess) Choose next state with highest U value Q’s (Model-free, though can also do model-based Q learning) Need only know which actions are legal (eg, web) Choose arc with highest Q value 1/19/17 Spring 2017 (Shavlik©)

18 Q-Learning (Watkins PhD, 1989)
Let Qt be our current estimate of the optimal Q Our current policy is Our current utility-function estimate is - hence, the U table is embedded in the Q table and we don’t need to store both 1/19/17 Spring 2017 (Shavlik©)

19 Q-Learning (cont.) Assume we are in state St ‘Run the program’ * for awhile (n steps) Determine actual reward and compare to predicted reward Adjust prediction to reduce error * Ie, follow the current policy 1/19/17 Spring 2017 (Shavlik©)

20 Updating Qt Let N -step estimate of future rewards
Actual (discounted) reward received during the N time steps Estimate of future reward if continued to t =  1/19/17 Spring 2017 (Shavlik©)

21 Changing the Q Function (ie, learn a better approx.)
Old estimate New estimate (at time t + N) Error Learning rate (for deterministic worlds, set α=1) 1/19/17 Spring 2017 (Shavlik©)

22 Pictorially (here rewards are on arcs, rather than states)
Actual moves made (in red) Potential next states S1 r1 r2 r3 SN +<estimate of remainder of infinite sum> 1/19/17 Spring 2017 (Shavlik©)

23 How Many Actions Should We Take Before Updating Q ?
Why not do so after each action? One–step Q learning Most common approach 1/19/17 Spring 2017 (Shavlik©)

24 Exploration vs. Exploitation
In order to learn about better alternatives, we can’t always follow the current policy (‘exploitation’) Sometimes, need to try random moves (‘exploration’) 1/19/17 Spring 2017 (Shavlik©)

25 Exploration vs. Exploitation (cont)
Approaches 1) p percent of the time, make a random move; could let 2) Prob(picking action A in state S ) Exponentia-ting gets rid of negative values 1/19/17 Spring 2017 (Shavlik©)

26 One-Step Q-Learning Algo
0. S  initial state 1. If random #  P then a = random choice // Occasionally ‘explore’ Else a = t(S) // Else ‘exploit’ 2. Snew  W(S, a) Rimmed  R(Snew) Error  Rimmed +  U(Snew) – Q(S, a) // Use Q to compute U Q(S, a)  Q(S, a) +  Error // Should also decay α S  Snew Go to 1 Act on world and get reward 1/19/17 Spring 2017 (Shavlik©)

27 Visualizing Q -Learning
(1-step ‘lookahead’) State I The estimate Q(I,a) Action a (get reward R) Should equal State J R +  max Q(J,x) a z train ML system to learn a consistent set of Q values b 1/19/17 Spring 2017 (Shavlik©)

28 Bellman Optimality Equation
(from 1957, though for U function back then) IF Where SN = W(s,a) , ie, the next state THEN The resulting policy, (s) = argmax Q(s,a), is optimal – ie, leads to highest discounted total rewards (also, any optimal policy satisfies the Bellman Eq) 1/19/17 Spring 2017 (Shavlik©)

29 A Simple Example (of Q-learning - with updates after each step, ie N =1)
Let  = 2/3 Q = 0 S0 R = 0 Q = 0 Q = 0 Q = 0 S3 R = 0 S2 R = -1 Q = 0 S4 R = 3 Q = 0 (deterministic world, so α=1) 1/19/17 Spring 2017 (Shavlik©)

30 A Simple Example (Step 1) S0  S2
R = 1 Let  = 2/3 Q = 0 S0 R = 0 Q = 0 Q = 0 Q = -1 S3 R = 0 S2 R = -1 Q = 0 S4 R = 3 Q = 0 1/19/17 Spring 2017 (Shavlik©)

31 A Simple Example (Step 2) S2  S4
R = 1 Let  = 2/3 Q = 0 S0 R = 0 Q = 0 Q = 0 Q = -1 S3 R = 0 S2 R = -1 Q = 0 S4 R = 3 Q = 3 1/19/17 Spring 2017 (Shavlik©)

32 A Simple Example (Step i) S0  S2
Assume we get to the end of the game and ‘magically’ restarted in S0 S1 R = 1 Let  = 2/3 Q = 0 S0 R = 0 Q = 0 Q = 0 Q = -1 S3 R = 0 S2 R = -1 Q = 0 S4 R = 3 Q = 3 1/19/17 Spring 2017 (Shavlik©)

33 A Simple Example (Step i +1) S0  S2
R = 1 Let  = 2/3 Q = 0 S0 R = 0 Q = 0 Q = 0 Q = 1 S3 R = 0 S2 R = -1 Q = 0 S4 R = 3 Q = 3 1/19/17 Spring 2017 (Shavlik©)

34 A Simple Example (Step ∞) - ie, the Bellman optimal
R = 1 Let  = 2/3 Q = ? S0 R = 0 Q = ? What would the final Q values be if we explored + exploited for a long time, always returning to S0 after 5 actions? Q = ? Q = ? S3 R = 0 S2 R = -1 Q = ? S4 R = 3 Q = ? 1/19/17 Spring 2017 (Shavlik©)

35 A Simple Example (Step ∞)
Let  = 2/3 What would happen if  > 2/3? Lower path better S1 R = 1 Q = 1 What would happen if  < 2/3 ? S0 R = 0 Upper path better Q = 0 Q = 0 Q = 1 Shows need for EXPLORATION since first ever action out of S0 may or may not be the optimal one S3 R = 0 S2 R = -1 Q = 0 S4 R = 3 Q = 3 1/19/17 Spring 2017 (Shavlik©)

36 An “On Your Own” RL HW Consider the deterministic reinforcement environment drawn below. Let γ=0.5. Immediate rewards are indicated inside nodes. Once the agent reaches the ‘end’ state the current episode ends and the agent is magically transported to the ‘start’ state. (a) A one-step, Q-table learner follows the path Start  B  C  End. On the graph below, show the Q values that have changed, and show your work. Assume that for all legal actions (ie, for all the arcs on the graph), the initial values in the Q table are 4, as show above (feel free to copy the above 4’s below, but somehow highlight the changed values).    Start (r=0) End (r=5) B (r=5) A (r=2) C (r=3) 4 Start (r=0) End (r=5) B (r=5) A (r=2) C (r=3) 1/19/17 Spring 2017 (Shavlik©)

37 An “On Your Own” RL HW (b) Starting with the Q table you produced in Part (a), again follow the path Start  B  C  End and show the Q values below that have changed from Part (a). Show your work. (c) What would the final Q values be in the limit of trying all possible arcs ‘infinitely’ often? Ie, what is the Bellman-optimal Q table? Explain your answer. (d) What is the optimal path between Start and End? Explain. Start (r=0) End (r=5) B (r=5) A (r=2) C (r=3) Start (r=0) End (r=5) B (r=5) A (r=2) C (r=3) 1/19/17 Spring 2017 (Shavlik©)

38 Q-Learning: The Need to ‘Generalize Across State’
Remember, conceptually we are filling in a huge table States S0 S1 S Sn Actions . a b c . z Tables are a very verbose representation of a function Q(S2, c) 1/19/17 Spring 2017 (Shavlik©)

39 Representing Q Functions More Compactly
We can use some other function representation (eg, neural net) to compactly encode this big table Second argument is a constant . Q (S, a) Q (S, b) Q (S, z) An encoding of the state (S) Each input unit encodes a property of the state (eg, a sensor value) Or could have one net for each possible action 1/19/17 Spring 2017 (Shavlik©)

40 Q Tables vs Q Nets Given: 100 Boolean-valued features
. Q (S, 0) Q (S, 1) Q (S, 9) Q Tables vs Q Nets Given: 100 Boolean-valued features 10 possible actions Size of Q table 10  2100 Size of Q net (100 HU’s) 100   10 = 11,000 Similar idea as Full Joint Prob Tables and Bayes Nets (called ‘factored’ representations) Weights between inputs and HU’s Weights between HU’s and outputs 1/19/17 Spring 2017 (Shavlik©)

41 Why Use a Compact Q-Function?
Full Q table may not fit in memory for realistic problems Can generalize across states, thereby speeding up convergence ie, one example ‘fills’ many cells in the Q table Notes When generalizing across states, cannot use α=1 Convergence proofs only apply to Q tables 1/19/17 Spring 2017 (Shavlik©)

42 Three Forward Props and a BackProp
Aside: could save some forward props by caching information Q(S0, A) 1 A N S0 .. . Choose action in state S0 - execute chosen action in world, ‘read’ new sensors and reward Q(S0, Z) A N S1 Q(S1, A) Q(S1, Z) .. . Estimate u(S1) = Max Q(S1,X) where X  actions 2 Q(S0, Z) - assume Q is ‘correct’ for other actions S0 Q(S0, A) vs new estimate A N .. . 3 Calc “teacher’s” output Backprop to reduce error at Q(S0, A) 1/19/17 Spring 2017 (Shavlik©)

43 The Agent World (Rough sketch, implemented in Java [by me], linked to cs540 home page)
* Pushable Ice Cubes * * * The RL Agent * * * * * Opponents Food 1/19/17 Spring 2017 (Shavlik©)

44 Some (Ancient) Agent World Results
Mean(discounted) score on the testset suite 50/25 15 HU -10 5 HU -20 < x slower CPU Q-table Perceptrons (600 ex’s) Hand-coded Q-net: 5 HU’s Q-net: 15 HU’s Q-net: 25 HU’s Q-net: 50 HU’s (Supervised learning) -30 -40 -50 -60 500 1000 1500 2000 Training-set steps (in K) ~2 weeks (10-20 yrs ago) 1/19/17 Spring 2017 (Shavlik©)

45 Estimating Value’s ‘In Place’
(see Sec of Sutton+Barto RL textbook) Let ri be our i th estimate of some Q Note: ri is not the immediate reward, Ri ri = Ri +  U(next statei) Assume we have k +1 such measurements 1/19/17 Spring 2017 (Shavlik©)

46 Estimating Value’s (cont)
Ave of the k + 1 measurements Estimate based on k + 1 trails Pull out last term Stick in definition of Qk 1/19/17 Spring 2017 (Shavlik©)

47 ‘In Place’ Estimates (cont.)
Add and subtract Qk Notice that  needs to decay over time current ‘running’ average latest estimate Repeating 1/19/17 Spring 2017 (Shavlik©)

48 Note The ‘running average’ analysis is for Q tables
When ‘generalizing across state,’ the Q values are coupled together So when generalizing across state, cant simply divide by number of times an arc traversed Also, even if DETERMINISTIC, still need to do a running average 1/19/17 Spring 2017 (Shavlik©)

49 Q-Learning Convergences
Only applies to Q tables and deterministic, Markovian worlds Theorem: if every state-action pair visited infinitely often, 0 ≤  < 1, and |rewards| ≤ C (some constant), then s, a the approx. Q table (Q) the true Q table (Q) ^ 1/19/17 Spring 2017 (Shavlik©)

50 An RL Video https://m.youtube.com/watch?v=iqXKQf2BOSE 1/19/17
Spring 2017 (Shavlik©)

51 Inverse RL Inverse RL: Learn the reward function of an agent by observing its behavior Some early papers A. Ng and S. Russell, “Algorithms for inverse reinforcement learning,” in ICML, 2000 P. Abbeel and A. Ng, “Apprenticeship learning via inverse reinforcement learning,” in ICML, 2004 1/19/17 Spring 2017 (Shavlik©)

52 Recap: Supervised Learners Helping the RL Learner
Note that Q learning automatically creates I/O pairs for a supervised ML algo when ‘generalizing across state’ Can also learn a model of the world (W) and the reward function (R) Simulations via learned models reduce need for ‘acting in the physical world’ 1/19/17 Spring 2017 (Shavlik©)

53 Challenges in RL Q tables too big, so use function approximation
can ‘generalize across state’ (eg, via ANNs) convergence proofs no longer apply, though Hidden state (‘perceptual aliasing’) two different states might look the same (eg, due to ‘local sensors’) can use theory of ‘Partially Observable Markov Decision Problems’ (POMDP’s) Multi-agent learning (world no longer stationary) 1/19/17 Spring 2017 (Shavlik©)

54 Could use GAs for RL Task
Another approach is to use GAs to evolve good policies Create N ‘agents’ Measure each’s rewards over some time period Discard worst, cross over best, do some mutation Repeat ‘forever’ (a model of biology) Both ‘predator’ and ‘prey’ evolve/learn, ie co-evolution 1/19/17 Spring 2017 (Shavlik©)

55 Summary of Non-GA Reinforcement Learning
Positives Requires much less ‘teacher feedback’ Appealing approach to learning to predict and control (eg, robotics, sofbots) Demo of Google’s Q Learning Solid mathematical foundations Dynamic programming Markov decision processes Convergence proofs (in the limit) Core of solution to general AI problem ? 1/19/17 Spring 2017 (Shavlik©)

56 Summary of Non-GA Reinforcement Learning (cont.)
Negatives Need to deal with huge state-action spaces (so convergence very slow) Hard to design R function ? Learns specific environment rather than general concepts – depends on state representation ? Dealing with multiple learning agents? Hard to learn at multiple ‘grain sizes’ (hierarchical RL) 1/19/17 Spring 2017 (Shavlik©)


Download ppt "Today’s Topics Reinforcement Learning (RL) Q learning"

Similar presentations


Ads by Google