© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Slides:



Advertisements
Similar presentations
Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.
Advertisements

Markov Decision Process
RL for Large State Spaces: Value Function Approximation
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Decision Theoretic Planning
Class Project Due at end of finals week Essentially anything you want, so long as it’s AI related and I approve Any programming language you want In pairs.
Reinforcement Learning
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Markov Decision Processes
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Reinforcement learning
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement Learning
Markov Decision Processes
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
© Daniel S. Weld 1 Reinforcement Learning CSE 473 Ever Feel Like Pavlov’s Poor Dog?
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Reinforcement Learning (1)
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
1 Reinforcement Learning Sungwook Yoon * Based in part on slides by Alan Fern and Daniel Weld.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Neural Networks Chapter 7
INTRODUCTION TO Machine Learning
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
© Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA.
Reinforcement learning (Chapter 21)
Reinforcement Learning
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Markov Decision Processes
Reinforcement Learning
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473

© D. Weld and D. Fox 2 Heli Flying

© D. Weld and D. Fox 3 Review: MDPs S = set of states set (|S| = n) A = set of actions (|A| = m) Pr = transition function Pr(s,a,s’) represented by set of m n x n stochastic matrices each defines a distribution over SxS R(s) = bounded, real-valued reward fun represented by an n-vector

© D. Weld and D. Fox 4 Goal for an MDP Find a policy which: maximizes expected discounted reward over an infinite horizon for a fully observable Markov decision process.

© D. Weld and D. Fox 5 Max Bellman Backup a1a1 a2a2 a3a3 s VnVn VnVn VnVn VnVn VnVn VnVn VnVn Q n+1 (s,a) V n+1 (s) Improve estimate of value function Vt+1(s) = R(s) + Max aεA {c(a)+γΣs’εS Pr(s’|a,s) Vt(s’)} Expected future reward Aver’gd over dest states

© D. Weld and D. Fox 6 Value Iteration Assign arbitrary values to each state (or use an admissible heuristic). Iterate over all states Improving value funct via Bellman Backups Stop the iteration when converges (V t approaches V* as t   ) Dynamic Programming

© D. Weld and D. Fox 7 How is learning to act possible when… Actions have non-deterministic effects Which are initially unknown Rewards / punishments are infrequent Often at the end of long sequences of actions Learner must decide what actions to take World is large and complex

© D. Weld and D. Fox 8 Naïve Approach 1.Act Randomly for a while (Or systematically explore all possible actions) 2.Learn Transition function Reward function 3.Use value iteration, policy iteration, … Problems?

© D. Weld and D. Fox 9 Example: Suppose given policy Want to determine how good it is

© D. Weld and D. Fox 10 Objective: Value Function

© D. Weld and D. Fox 11 Passive RL Given policy , estimate U  (s) Not given transition matrix, nor reward function! Epochs: training sequences (1,1)  (1,2)  (1,3)  (1,2)  (1,3)  (1,2)  (1,1)  (1,2)  (2,2)  (3,2) –1 (1,1)  (1,2)  (1,3)  (2,3)  (2,2)  (2,3)  (3,3) +1 (1,1)  (1,2)  (1,1)  (1,2)  (1,1)  (2,1)  (2,2)  (2,3)  (3,3) +1 (1,1)  (1,2)  (2,2)  (1,2)  (1,3)  (2,3)  (1,3)  (2,3)  (3,3) +1 (1,1)  (2,1)  (2,2)  (2,1)  (1,1)  (1,2)  (1,3)  (2,3)  (2,2)  (3,2) -1 (1,1)  (2,1)  (1,1)  (1,2)  (2,2)  (3,2) -1

© D. Weld and D. Fox 12 Approach 1 Direct estimation Estimate U  (s) as average total reward of epochs containing s (calculating from s to end of epoch) Pros / Cons? Requires huge amount of data doesn’t exploit Bellman constraints! Expected utility of a state = its own reward + expected utility of successors

© D. Weld and D. Fox 13 Temporal Difference Learning Do backups on a per-action basis Don’t try to estimate entire transition function! For each transition from s to s’, update:  =  = Learning rate Discount rate

© D. Weld and D. Fox 14 Notes Once U is learned, updates become 0: Adjusts state to ‘agree’ with observed successor Not all possible successors Doesn’t require M, model of transition function

© D. Weld and D. Fox 15 Notes II “TD(0)” One step lookahead Can do 2 step, 3 step…

© D. Weld and D. Fox 16 TD( ) Or, … take it to the limit! Compute weighted average of all future states becomes weighted average Implementation Propagate current weighted TD onto past states Must memorize states visited from start of epoch

© D. Weld and D. Fox 17 Q-Learning Version of TD-learning where instead of learning value funct on states we learn funct on [state,action] pairs [Helpful for model-free policy learning]

© D. Weld and D. Fox 18 Part II So far, we’ve assumed agent had policy Now, suppose agent must learn it While acting in uncertain world

© D. Weld and D. Fox 19 Active Reinforcement Learning Suppose agent must make policy while learning First approach: Start with arbitrary policy Apply Q-Learning New policy: In state s, Choose action a that maximizes Q(a,s) Problem?

© D. Weld and D. Fox 20 Utility of Exploration Too easily stuck in non-optimal space “Exploration versus exploitation tradeoff” Solution 1 With fixed probability perform a random action Solution 2 Increase est expected value of infrequent states

© D. Weld and D. Fox 21 ~Worlds Best Player Neural network with 80 hidden units Used computed features 300,000 games against self

Imitation Learning What can you do if you have a teacher? People are often … … good at demonstrating a system … bad at specifying exact rewards / utilities Idea: Learn the reward function that best “explains” demonstrated behavior That is, learn reward such that demonstrated behavior is optimal wrt. It Also called apprenticeship learning, inverse RL © D. Weld and D. Fox 22

Data Collection Length Speed Road Type Lanes Accidents Construction Congestion Time of day 25 Taxi Drivers Over 100,000 miles Courtesy of B. Ziebart

Destination Prediction Courtesy of B. Ziebart

Heli Airshow Courtesy of P. Abbeel

© D. Weld and D. Fox 26 Summary Use reinforcement learning when Model of world is unknown and/or rewards are delayed Temporal difference learning Simple and efficient training rule Q-learning eliminates need for explicit T model Large state spaces can (sometimes!) be handled Function approximation, using linear functions or neural nets