Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.

Slides:



Advertisements
Similar presentations
University Paderborn 07 January 2009 RG Knowledge Based Systems Prof. Dr. Hans Kleine Büning Reinforcement Learning.
Advertisements

Reinforcement Learning
Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.
Lecture 18: Temporal-Difference Learning
PSSA Preparation.
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Planning under Uncertainty
Reinforcement learning
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2009
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010 Reinforcement Learning 10/25/2010 A subset of slides from Dan Klein – UC Berkeley Many.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
CHAPTER 10 Reinforcement Learning Utility Theory.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
QUIZ!!  T/F: RL is an MDP but we do not know T or R. TRUE  T/F: In model free learning we estimate T and R first. FALSE  T/F: Temporal Difference learning.
Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.
MDPs (cont) & Reinforcement Learning
gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement learning (Chapter 21)
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Reinforcement Learning
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
CSE 473: Artificial Intelligence Spring 2012 Reinforcement Learning Dan Weld Many slides adapted from either Dan Klein, Stuart Russell, Luke Zettlemoyer.
CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 188: Artificial Intelligence Fall 2008 Lecture 11: Reinforcement Learning 10/2/2008 Dan Klein – UC Berkeley Many slides over the course adapted from.
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Markov Decision Processes
CS 188: Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008
CSE 473: Artificial Intelligence Autumn 2011
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Spring 2006
Announcements Homework 2 Project 2 Mini-contest 1 (optional)
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Midterm –mean 82/99 –Problem 5 (b) 2 has 8 points now –your grade on LMS Project 2 due/Project 3 out on this Friday 1 Reminder

Markov decision processes Computing the optimal policy –value iteration –policy iteration 2 Last time

Grid World 3 The agent lives in a grid Walls block the agents path The agents actions do not always go as planned: –80% of the time, the action North takes the agent North (if there is no wall there) –10% of the time, North takes the agent West; 10% East –If there is a wall in the direction the agent would have taken, the agent stays for this turn Small living reward each step Big rewards come at the end Goal: maximize sum of reward

Markov Decision Processes 4 An MDP is defined by: –A set of states s S –A set of actions a A –A transition function T( s, a, s ) Prob that a from s leads to s i.e., p( s | s, a ) sometimes called the model –A reward function R( s, a, s ) Sometimes just R( s ) or R( s ) –A start state (or distribution) –Maybe a terminal state MDPs are a family of nondeterministic search problems –Reinforcement learning (next class): MDPs where we dont know the transition or reward functions

Recap: Defining MDPs 5 Markov decision processes: –States S –Start state s 0 –Actions A –Transition p(s|s,a) (or T(s,a,s)) –Reward R(s,a,s) (and discount ϒ ) MDP quantities so far: –Policy = Choice of action for each (MAX) state –Utility (or return) = sum of discounted rewards

Optimal Utilities 6 Fundamental operation: compute the values (optimal expectimax utilities) of states s Define the value of a state s: –V*(s) = expected utility starting in s and acting optimally Define the value of a Q-state (s,a): –Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally Define the optimal policy: – π *(s) = optimal action from state s

The Bellman Equations 7 One-step lookahead relationship amongst optimal utility values:

Solving MDPs 8 We want to find the optimal policy Proposal 1: modified expectimax search, starting from each state s:

Value Iteration 9 Idea: –Start with V 1 (s) = 0 –Given V i, calculate the values for all states for depth i+1: –Repeat until converge –Use V i as evaluation function when computing V i+1

Example: Value Iteration 10 Information propagates outward from terminal states and eventually all states have correct value estimates

Policy Iteration 11 Alternative approach: –Step 1: policy evaluation: calculate utilities for some fixed policy (not optimal utilities!) –Step 2: policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities as future values –Repeat steps until policy converges

Still have an MDP: –A set of states –A set of actions (per state) A –A model T(s,a,s) –A reward function R(s,a,s) Still looking for an optimal policy π* (s) New twist: dont know T and/or R, but can observe R –Learn by doing –can have multiple episodes (trials) 12 Today: Reinforcement learning

Studied experimentally for more than 60 years in psychology –Rewards: food, pain, hunger, drugs, etc. Example: foraging –Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies 13 Example: animal learning

Stanford autonomous helicopter 14 What can you do with RL

Model-based learning –learn the model of MDP (transition probability and reward) –compute the optimal policy as if the learned model is correct Model-free learning –learn the optimal policy without explicitly learning the transition probability –Q-learning: learn the Q-state Q(s,a) directly 15 Reinforcement learning methods

Model-Based Learning 16 Idea: –Learn the model empirically in episodes –Solve for values as if the learned model were correct Simple empirical model learning –Count outcomes for each s,a –Normalize to give estimate of T(s,a,s) –Discover R(s,a,s) when we experience (s,a,s) Solving the MDP with the learned model –Iterative policy evaluation, for example

Example: Model-Based Learning 17 Episodes: (1,1) up-1 (1,2) up-1 (1,2) up-1 (1,2) up-1 (1,3) right-1 (1,3) right-1 (2,3) right-1 (2,3) right-1 (3,3) right-1 (3,3) right-1 (3,2) up-1 (3,2) up-1 (4,2) exit-100 (3,3) right-1 (done) (4,3) exit +100 (done) γ= 1 T(, right, ) = 1/3 T(, right, ) =2/2

Model-based vs. Model-Free 18 Want to compute an expectation weighted by probability p(x) (e.g. expected utility): Model-based: estimate p(x) from samples, compute expectation Model-free: estimate expectation directly from samples Why does this work? Because samples appear with the right frequencies!

Flip a biased coin –if head, you get $10 –if tail, you get $1 –you dont know the probability of head/tail –What is your expected gain? 8 episodes –h, t, t, t, h, t, t, h Model-based: p(h)=3/8, so E(gain) = 10*3/8+1*5/8=35/8 Model-free: ( )/8=35/8 19 Example

Sample-Based Policy Evaluation? 20 Approximate the expectation with samples (drawn from an unknown T!) Almost! But we cannot rewind time to get samples from s in the same episode

Temporal-Difference Learning 21 Big idea: learn from every experience! –Update V(s) each time we experience (s,a,s,R) –Likely s will contribute updates more often Temporal difference learning –Policy still fixed

Exponential Moving Average 22 Exponential moving average –Makes recent samples more important –Forgets about the past (distant past values were wrong anyway) –Easy to compute from the running average Decreasing learning rate can give converging averages

Problems with TD Value Learning 23 TD value learning is a model-free way to do policy evaluation However, if we want to turn values into a (new) policy, were sunk: Idea: learn Q-value directly Makes action selection model-free too!

Active Learning 24 Full reinforcement learning –You dont know the transitions T(s,a,s) –You dont know the rewards R(s,a,s) –You can choose any actions you like –Goal: learn the optimal policy In this case: –Learner makes choices! –Fundamental tradeoff: exploration vs. exploitation exploration: try new actions exploitation: focus on optimal actions based on the current estimation –This is NOT offline planning! You actually take actions in the world and see what happens…

Detour: Q-Value Iteration 25 Value iteration: find successive approx optimal values –Start with V 0 *(s)=0 –Given V i *, calculate the values for all states for depth i+1: But Q-values are more useful –Start with Q 0 *(s,a)=0 –Given Q i *, calculate the Q-values for all Q-states for depth i+1:

Q-Learning 26 Q-Learning: sample-based Q-value iteration Learn Q*(s,a) values –Receive a sample (s,a,s,R) –Consider your old estimate: Q(s,a) –Consider your new sample estimate: –Incorporate the new estimate into a running average

Q-Learning Properties 27 Amazing result: Q-learning converges to optimal policy –If you explore enough –If you make the learning rate small enough –…but not decrease it too quickly! –Basically doesnt matter how you select actions (!) Neat property: off-policy learning –Learn optimal policy without following it (some caveats)

Q-Learning 28 Q-learning produces tables of Q-values:

Exploration / Exploitation 29 Several schemes for forcing exploration –Simplest: random actions (ε greedy) Every time step, flip a coin With probability ε, act randomly With probability 1-ε, act according to current policy –Problems with random actions? You do explore the space, but keep thrashing around once learning is done One solution: lower ε over time Another solution: exploration functions

Exploration Functions 30 When to explore –Random actions: explore a fixed amount –Better idea: explore areas whose badness is not (yet) established Exploration function –Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important) sample =