Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Slides:



Advertisements
Similar presentations
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Advertisements

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.
Squares and Square Root WALK. Solve each problem REVIEW:
Lecture 18: Temporal-Difference Learning
25 seconds left…...
Week 1.
We will resume in: 25 Minutes.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 14 From Randomness to Probability.
Nov 7, 2013 Lirong Xia Hypothesis testing and statistical decision theory.
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Batch RL Via Least Squares Policy Iteration
RL for Large State Spaces: Value Function Approximation
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement Learning
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
CS 188: Artificial Intelligence Fall 2009
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010 Reinforcement Learning 10/25/2010 A subset of slides from Dan Klein – UC Berkeley Many.
CS 188: Artificial Intelligence Fall 2009 Lecture 12: Reinforcement Learning II 10/6/2009 Dan Klein – UC Berkeley Many slides over the course adapted from.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
CSE 573: Artificial Intelligence
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
QUIZ!!  T/F: RL is an MDP but we do not know T or R. TRUE  T/F: In model free learning we estimate T and R first. FALSE  T/F: Temporal Difference learning.
gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning
Warmup: Which one would you try next? 1 Tries: 1000 Winnings: 900 Tries: 100 Winnings: 90 Tries: 5 Winnings: 4 Tries: 100 Winnings: 0 ABCD.
CSE 473: Artificial Intelligence Spring 2012 Reinforcement Learning Dan Weld Many slides adapted from either Dan Klein, Stuart Russell, Luke Zettlemoyer.
CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 188: Artificial Intelligence Fall 2008 Lecture 11: Reinforcement Learning 10/2/2008 Dan Klein – UC Berkeley Many slides over the course adapted from.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
CS 188: Artificial Intelligence Fall 2006
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
CSE 473: Artificial Intelligence Autumn 2011
CSE 473: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
Hidden Markov Models (cont.) Markov Decision Processes
CS 188: Artificial Intelligence Spring 2006
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014

Project 2 due tonight Project 3 is online (more later) –due in two weeks 1 Reminder

Recap: MDPs 2 Markov decision processes: –States S –Start state s 0 –Actions A –Transition p(s’|s,a) (or T(s,a,s’)) –Reward R(s,a,s’) (and discount ϒ ) MDP quantities: –Policy = Choice of action for each (MAX) state –Utility (or return) = sum of discounted rewards

Optimal Utilities 3 The value of a state s: –V*(s) = expected utility starting in s and acting optimally The value of a Q-state (s,a): –Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally The optimal policy: – π *(s) = optimal action from state s

Solving MDPs 4 Value iteration –Start with V 1 (s) = 0 –Given V i, calculate the values for all states for depth i+1: –Repeat until converge –Use V i as evaluation function when computing V i+1 Policy iteration –Step 1: policy evaluation: calculate utilities for some fixed policy –Step 2: policy improvement: update policy using one-step look-ahead with resulting utilities as future values –Repeat until policy converges

Don’t know T and/or R, but can observe R –Learn by doing –can have multiple episodes (trials) 5 Reinforcement learning

The Story So Far: MDPs and RL 6 Things we know how to do: If we know the MDP –Compute V*, Q*, π* exactly –Evaluate a fixed policy π If we don’t know the MDP –If we can estimate the MDP then solve –We can estimate V for a fixed policy π –We can estimate Q*(s,a) for the optimal policy while executing an exploration policy Techniques: Computation Value and policy iteration Policy evaluation Model-based RL sampling Model-free RL: Q-learning

Model-Free Learning 7 Model-free (temporal difference) learning –Experience world through episodes (s,a,r,s’,a’,r’,s’’,a’’,r’’,s’’’…) –Update estimates each transition (s,a,r,s’) –Over time, updates will mimic Bellman updates

Detour: Q-Value Iteration 8 We’d like to do Q-value updates to each Q-state: –But can’t compute this update without knowing T,R Instead, compute average as we go –Receive a sample transition (s,a,r,s’) –This sample suggests –But we want to merge the new observation to the old ones –So keep a running average

Q-Learning Properties 9 Will converge to optimal policy –If you explore enough (i.e. visit each q-state many times) –If you make the learning rate small enough –Basically doesn’t matter how you select actions (!) Off-policy learning: learns optimal q-values, not the values of the policy you are following

Q-Learning 10 Q-learning produces tables of q-values:

Exploration / Exploitation 11 R andom actions (ε greedy) –Every time step, flip a coin –With probability ε, act randomly –With probability 1-ε, act according to current policy

Today: Q-Learning with state abstraction 12 In realistic situations, we cannot possibly learn about every single state! –Too many states to visit them all in training –Too many states to hold the Q-tables in memory Instead, we want to generalize: –Learn about some small number of training states from experience –Generalize that experience to new, similar states –This is a fundamental idea in machine learning, and we’ll see it over and over again

Example: Pacman 13 Let’s say we discover through experience that this state is bad: In naive Q-learning, we know nothing about this state or its Q-states: Or even this one!

Feature-Based Representations 14 Solution: describe a state using a vector of features (properties) –Features are functions from states to real numbers (often 0/1) that capture important properties of the state –Example features: Distance to closest ghost Distance to closest dot Number of ghosts 1/ (dist to dot) 2 Is Pacman in a tunnel? (0/1) …etc Is it the exact state on this slide? –Can also describe a Q-state (s,a) with features (e.g. action moves closer to food) Similar to a evaluation function

Linear Feature Functions 15 Using a feature representation, we can write a Q function (or value function) for any state using a few weights: Advantage: our experience is summed up in a few powerful numbers Disadvantage: states may share features but actually be very different in value!

Function Approximation 16 Q-learning with linear Q-functions: transition = (s,a,r,s’) Intuitive interpretation: –Adjust weights of active features –E.g. if something unexpectedly bad happens, disprefer all states with that state’s features Formal justification: online least squares Exact Q’s Approximate Q’s

Example: Q-Pacman 17 Q(s,a)= 4.0f DOT (s,a)-1.0f GST (s,a) f DOT (s,NORTH)=0.5 f GST (s,NORTH)=1.0 Q(s,a)=+1 R(s,a,s’)=-500 difference=-501 w DOT ←4.0+α[-501]0.5 w GST ←-1.0+α[-501]1.0 Q(s,a)= 3.0f DOT (s,a)-3.0f GST (s,a) α= North r = -500 s s’

Linear Regression 18

Ordinary Least Squares (OLS) 19

Minimizing Error 20 Imagine we had only one point x with features f(x): Approximate q update explained: “target” “prediction”

As many as possible? –computational burden –overfitting Feature selection is important –requires domain expertise 21 How many features should we use?

Overfitting 22

The elements of statistical learning. Fig

MDPs –Q1: value iteration –Q2: find parameters that lead to certain optimal policy –Q3: similar to Q2 Q-learning –Q4: implement the Q-learning algorithm –Q5: implement ε greedy action selection –Q6: try the algorithm Approximate Q-learning and state abstraction –Q7: Pacman –Q8 (bonus): design and implement a state-abstraction Q- learning algorithm Hints –make your implementation general –try to use class methods as much as possible 24 Overview of Project 3