לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה.

Slides:



Advertisements
Similar presentations
Lecture 18: Temporal-Difference Learning
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
RL for Large State Spaces: Value Function Approximation
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Adversarial Search Chapter 5.
Reinforcement Learning
Reinforcement learning (Chapter 21)
Minimax and Alpha-Beta Reduction Borrows from Spring 2006 CS 440 Lecture Slides.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Reinforcement learning
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Reinforcement Learning Tutorial
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Persistent Autonomous FlightNicholas Lawrance Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance.
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
Temporal Difference Learning By John Lenz. Reinforcement Learning Agent interacting with environment Agent receives reward signal based on previous action.
Introduction Many decision making problems in real life
Reinforcement Learning
Well Posed Learning Problems Must identify the following 3 features –Learning Task: the thing you want to learn. –Performance measure: must know when you.
Reinforcement Learning Generalization and Function Approximation Subramanian Ramamoorthy School of Informatics 28 February, 2012.
CHAPTER 10 Reinforcement Learning Utility Theory.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Non-Bayes classifiers. Linear discriminants, neural networks.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé By Sutton.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 9 of 42 Wednesday, 14.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Reinforcement learning (Chapter 21)
Retraction: I’m actually 35 years old. Q-Learning.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Pedagogical Possibilities for the 2048 Puzzle Game Todd W. Neller.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Well Posed Learning Problems Must identify the following 3 features –Learning Task: the thing you want to learn. –Performance measure: must know when you.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Stochastic tree search and stochastic games
Reinforcement Learning
Mastering the game of Go with deep neural network and tree search
Reinforcement learning (Chapter 21)
AlphaGo with Deep RL Alpha GO.
Reinforcement learning (Chapter 21)
Reinforcement Learning
Reinforcement Learning
"Playing Atari with deep reinforcement learning."
Reinforcement learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
RL for Large State Spaces: Value Function Approximation
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
یادگیری تقویتی Reinforcement Learning
Reinforcement Learning (2)
Reinforcement Learning (2)
CS 440/ECE448 Lecture 22: Reinforcement Learning
Presentation transcript:

לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה )

Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University

Outline Week I: Basics –Mathematical Model (MDP) –Planning Value iteration Policy iteration Week II: Learning Algorithms –Model based –Model Free Week III: Large state space

Learning Algorithms Given access only to actions perform: 1. policy evaluation. 2. control - find optimal policy. Two approaches: 1. Model based (Dynamic Programming). 2. Model free (Q-Learning, SARSA).

Learning: Policy improvement Assume that we can compute: –Given a policy π, –The V and Q functions of π Can perform policy improvement: –Π= Greedy (Q) Process converges if estimations are accurate.

Learning - Model Free Optimal Control: off-policy Learn online the Q function. Q t+1 (s t,a t ) = Q t (s t,a t )+  A t OFF POLICY: Q-Learning Maximization Operator!!! A t = r t +  MAX a {Q t (s t+1,a )} - Q t (s t,a t )

Learning - Model Free Policy evaluation: TD(0) An online view: At state s t we performed action a t, received reward r t and moved to state s t+1. Our “estimation error” is A t =r t +  V t (s t+1 )-V t (s t ), The update: V t +1 (s t ) = V t (s t ) +  A t No maximization over actions!

Learning - Model Free Optimal Control: on-policy Learn online the optimal Q * function. Q t+1 (s t,a t ) = Q t (s t,a t )+  r t +  Q t (s t+1,a t+1 ) - Q t (s t,a t )] ON-Policy: SARSA a t+1 the  -greedy policy for Q t. The policy selects the action! Need to balance exploration and exploitation.

Modified Notation Rather than Q(s,a) have Q a (s) Greedy(Q) = MAX a Q a (s) Each action has a function Q a (s) Learn each Q a (s) independently!

Large state space Reduce number of states –Symmetries (x-o) –Cluster states Define attributes Limited number of attributes Some states will be identical –Action view of a state

Example X-O For each action (square) –Consider row/diagonal/column through it –The state will encode the status of “rows”: Two X’s Two O’s Mixed (both X and O) One X One O empty –Only Three types of squares/actions

Clustering states Need to create attributes Attributes should be “game dependent” Different “real” states - same representation How do we differentiate states? –We estimate action value. –Consider only legal actions. –Play “best” action.

Function Approximation Use a limited model for Q a (s) Have an attribute vector: –Each state s has a vector vec(s)=x 1... x k –Normally k << |S| Examples: –Neural Network –Decision tree –Linear Function Weights  =  1...  k Value   i x i

Gradient Decent Minimize Squared Error –Square Error = ½  P(s) [V  (s) – V  (s)] 2 –P(s) is sum weighting on the states Algorithm: –  (t+1) =  (t) +  [V  (s t ) – V  (t) (s t )]   (t) V  (t) (s t ) –   (t) = partial derivatives –Replace V  (s t ) by a sample Monte Carlo: use R t for V  (s t ) TD(0) use A t for [V  (s t ) – V  (t) (s t )]

Linear Functions Linear function:   i x i = Derivative   (t) V t (s t ) = vec(s t ) Update Rule: –  t+1 =  t +  [V  (s t ) – V t (s t )] vec(s t ) –MC:  t+1 =  t +  [ R t – ] vec(s t ) –TD:  t+1 =  t +  A t vec(s t )

Example: 4 in a row Select attributes for action (column): –3 in a row (type X or type O) –2 in a row (type X or O) and [blocked/ not] –Next location 3 in a row. Next move might lose –Other “features” RL will learn the weights. Look ahead significantly helps –use max-min tree

Bootstraping Playing against a “good” player –Using.... Self play –Start with a random player –play against one self. Choose a starting point. –Max-Min tree with simple scoring function. Add some simple guidance –add “compulsory” moves.

Scoring Function Checkers: –Number of pieces –Number of Queens Chess –Weighted sum of pieces Othello/Reversi –Difference in number of pieces Can be used with Max-Min Tree –( ,  ) pruning

Example: Revesrsi (Othello) Use a simple score functions: –difference in pieces –edge pieces –corner pieces Use Max-Min Tree RL: optimize weights.

Advanced issues Time constraints –fast and slow modes Opening –can help End game –many cases: few pieces, –can be solved efficiently Train on a specific state –might be helpful/ not sure that its worth the effort.

What is Next? Create teams: –at least 2 students at most 3 students Group size will influence our expectations! –Choose a game! –Give the names and game GUI for game –Deadline Dec. 17, 2006

Schedule (more) System specification –Project outline –High level components planning –Jan. 21, 2007 Build system Project completion –April 29, 2007 All supporting documents in html!

Next week GUI interface (using C++) Afterwards: –Each groups works by itself