Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed.

Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed

Rutgers CS440, Fall 2003 Outline What is RL? Methods for RL. Note: only brief overview, no in-depth coverage.

Rutgers CS440, Fall 2003 What is Reinforcement Learning (RL)? Learning so far: learning probabilistic models (BNs) or functions (NNs) Learning what/how to do from feedback (reward/reinforcement). –Chess playing – learn how to play from feedback won/lost game. –Learning to speak, crawl, … –Learning user preferences for web searching MDP – find optimal policy using known model –Optimal policy = maximizes expected total reward RL – learn optimal policy from rewards –Do not know environment model –Do not know reward function –Know how well something is done (e.g., won / lost)

Rutgers CS440, Fall 2003 Types of RL MDP: actions + states + rewards Passive learning: policy fixed, learn utility of states ( + rest of model ) Active learning: policy not fixed, learn utility as well as optimal policy S0S0 A0A0 R0R0 S1S1 A1A1 R1R1 S2S2 A2A2 R2R2 … P(S t | S t-1, A t-1 ) R(S t )

Rutgers CS440, Fall 2003 Passive RL Policy is known and fixed, need to learn how good it is + environment model Learn: U(s t ) [ but do not know P(s t | s t-1, a t-1 ) and R( s t ) ] from { a t } Method: conduct trials, receive sequence of actions, states and rewards { (a t,s t,R t ) }, compute model parameters and utility S0S0 A0A0 R0R0 S1S1 A1A1 R1R1 S2S2 A2A2 R2R2 … P(S t | S t-1, A t-1 ) R(S t ) atat NAAAA stst NL LL rtrt -20020

Rutgers CS440, Fall 2003 Direct utility estimation Observe (a t,s t,R t ), estimate U(s t ) from “counts” – inductive learning atat NAAAA stst NL LL rtrt -20020 atat NAAA A stst NL LLL rtrt -200205 Sample #1Sample #2 … Example (  =1): Sample #1: U(NL) = -20 + 0 +20 +20 = 20, U(NL) = 0 + 20 + 20 = 40 Sample #2: U(NL) = -20 + 0 + 20 + 5 + 20 = 25, U(NL) = 0+20+5+20 = 45 On average, U(NL) = ( 20+40+25+45 ) / 4 Drawback: does not use the fact that utilities of states are dependent (Bellman equations)!

Rutgers CS440, Fall 2003 Adaptive dynamic programming Take into account constraints described by Bellman equations Algorithm: For each sample, each time step 1.Estimate P(s t | s t-1, a t-1 ) E.g., P(L|NL,A) = #(L,NL,A) / #(NL,A) 2.Compute U(s t ) from R(s t ), P(s t | s t-1, a t-1 ), using Bellman equations or update Drawback: usually (too) many states

Rutgers CS440, Fall 2003 TD-Learning Only update U-values for observed transitions Algorithm: 1.Receive new sample pair, (s t,s t+1 ) 2.Assume only transition s t  s t+1 can occur 3.Compute update of U Does not need to compute model parameters! ( Yet converges to the “right” solution. ) Value computed from Bellman equation Old value New value

Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed.

Similar presentations

Presentation on theme: "Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed.

Similar presentations

Presentation on theme: "Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed."— Presentation transcript:

Similar presentations

About project

Feedback