Presentation is loading. Please wait.

Presentation is loading. Please wait.

לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה.

Similar presentations


Presentation on theme: "לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה."— Presentation transcript:

1 לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה )

2 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University

3 Outline Week I: Basics –Mathematical Model (MDP) –Planning Value iteration Policy iteration Week II: Learning Algorithms –Model based –Model Free Week III: Large state space

4 Learning Algorithms Given access only to actions perform: 1. policy evaluation. 2. control - find optimal policy. Two approaches: 1. Model based (Dynamic Programming). 2. Model free (Q-Learning, SARSA).

5 Learning: Policy improvement Assume that we can compute: –Given a policy π, –The V and Q functions of π Can perform policy improvement: –Π= Greedy (Q) Process converges if estimations are accurate.

6 Learning - Model Free Optimal Control: off-policy Learn online the Q function. Q t+1 (s t,a t ) = Q t (s t,a t )+  A t OFF POLICY: Q-Learning Maximization Operator!!! A t = r t +  MAX a {Q t (s t+1,a )} - Q t (s t,a t )

7 Learning - Model Free Policy evaluation: TD(0) An online view: At state s t we performed action a t, received reward r t and moved to state s t+1. Our “estimation error” is A t =r t +  V t (s t+1 )-V t (s t ), The update: V t +1 (s t ) = V t (s t ) +  A t No maximization over actions!

8 Learning - Model Free Optimal Control: on-policy Learn online the optimal Q * function. Q t+1 (s t,a t ) = Q t (s t,a t )+  r t +  Q t (s t+1,a t+1 ) - Q t (s t,a t )] ON-Policy: SARSA a t+1 the  -greedy policy for Q t. The policy selects the action! Need to balance exploration and exploitation.

9 Modified Notation Rather than Q(s,a) have Q a (s) Greedy(Q) = MAX a Q a (s) Each action has a function Q a (s) Learn each Q a (s) independently!

10 Large state space Reduce number of states –Symmetries (x-o) –Cluster states Define attributes Limited number of attributes Some states will be identical –Action view of a state

11 Example X-O For each action (square) –Consider row/diagonal/column through it –The state will encode the status of “rows”: Two X’s Two O’s Mixed (both X and O) One X One O empty –Only Three types of squares/actions

12 Clustering states Need to create attributes Attributes should be “game dependent” Different “real” states - same representation How do we differentiate states? –We estimate action value. –Consider only legal actions. –Play “best” action.

13 Function Approximation Use a limited model for Q a (s) Have an attribute vector: –Each state s has a vector vec(s)=x 1... x k –Normally k << |S| Examples: –Neural Network –Decision tree –Linear Function Weights  =  1...  k Value   i x i

14 Gradient Decent Minimize Squared Error –Square Error = ½  P(s) [V  (s) – V  (s)] 2 –P(s) is sum weighting on the states Algorithm: –  (t+1) =  (t) +  [V  (s t ) – V  (t) (s t )]   (t) V  (t) (s t ) –   (t) = partial derivatives –Replace V  (s t ) by a sample Monte Carlo: use R t for V  (s t ) TD(0) use A t for [V  (s t ) – V  (t) (s t )]

15 Linear Functions Linear function:   i x i = Derivative   (t) V t (s t ) = vec(s t ) Update Rule: –  t+1 =  t +  [V  (s t ) – V t (s t )] vec(s t ) –MC:  t+1 =  t +  [ R t – ] vec(s t ) –TD:  t+1 =  t +  A t vec(s t )

16 Example: 4 in a row Select attributes for action (column): –3 in a row (type X or type O) –2 in a row (type X or O) and [blocked/ not] –Next location 3 in a row. Next move might lose –Other “features” RL will learn the weights. Look ahead significantly helps –use max-min tree

17 Bootstraping Playing against a “good” player –Using.... Self play –Start with a random player –play against one self. Choose a starting point. –Max-Min tree with simple scoring function. Add some simple guidance –add “compulsory” moves.

18 Scoring Function Checkers: –Number of pieces –Number of Queens Chess –Weighted sum of pieces Othello/Reversi –Difference in number of pieces Can be used with Max-Min Tree –( ,  ) pruning

19 Example: Revesrsi (Othello) Use a simple score functions: –difference in pieces –edge pieces –corner pieces Use Max-Min Tree RL: optimize weights.

20 Advanced issues Time constraints –fast and slow modes Opening –can help End game –many cases: few pieces, –can be solved efficiently Train on a specific state –might be helpful/ not sure that its worth the effort.

21 What is Next? Create teams: –at least 2 students at most 3 students Group size will influence our expectations! –Choose a game! –Give the names and game GUI for game –Deadline Dec. 17, 2006

22 Schedule (more) System specification –Project outline –High level components planning –Jan. 21, 2007 Build system Project completion –April 29, 2007 All supporting documents in html!

23 Next week GUI interface (using C++) Afterwards: –Each groups works by itself


Download ppt "לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה."

Similar presentations


Ads by Google