Reinforcement Learning (2)

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
RL for Large State Spaces: Value Function Approximation
Decision Theoretic Planning
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Reinforcement learning
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
CS 188: Artificial Intelligence Fall 2009
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010 Reinforcement Learning 10/25/2010 A subset of slides from Dan Klein – UC Berkeley Many.
CS 188: Artificial Intelligence Fall 2009 Lecture 12: Reinforcement Learning II 10/6/2009 Dan Klein – UC Berkeley Many slides over the course adapted from.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
CSE 573: Artificial Intelligence
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
QUIZ!!  T/F: RL is an MDP but we do not know T or R. TRUE  T/F: In model free learning we estimate T and R first. FALSE  T/F: Temporal Difference learning.
gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement learning (Chapter 21)
Reinforcement Learning
Warmup: Which one would you try next? 1 Tries: 1000 Winnings: 900 Tries: 100 Winnings: 90 Tries: 5 Winnings: 4 Tries: 100 Winnings: 0 ABCD.
CSE 473: Artificial Intelligence Spring 2012 Reinforcement Learning Dan Weld Many slides adapted from either Dan Klein, Stuart Russell, Luke Zettlemoyer.
CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 188: Artificial Intelligence Fall 2008 Lecture 11: Reinforcement Learning 10/2/2008 Dan Klein – UC Berkeley Many slides over the course adapted from.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.
Markov Decision Process (MDP)
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Probabilistic reasoning over time
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement learning (Chapter 21)
Reinforcement Learning
"Playing Atari with deep reinforcement learning."
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2006
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.
RL for Large State Spaces: Value Function Approximation
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
CSE 473: Artificial Intelligence Autumn 2011
Reinforcement Learning
CSE 473: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Spring 2006
CS 416 Artificial Intelligence
Announcements Homework 2 Project 2 Mini-contest 1 (optional)
Markov Decision Processes
Probabilistic reasoning over time
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Reinforcement Learning (2) Lirong Xia

Recap: MDPs Markov decision processes: MDP quantities: States S Start state s0 Actions A Transition p(s’|s,a) (or T(s,a,s’)) Reward R(s,a,s’) (and discount ϒ) MDP quantities: Policy = Choice of action for each (MAX) state Utility (or return) = sum of discounted rewards

Optimal Utilities The value of a state s: V*(s) = expected utility starting in s and acting optimally The value of a Q-state (s,a): Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally The optimal policy: π*(s) = optimal action from state s

Solving MDPs Value iteration Policy iteration Start with V1(s) = 0 Given Vi, calculate the values for all states for depth i+1: Repeat until converge Use Vi as evaluation function when computing Vi+1 Policy iteration Step 1: policy evaluation: calculate utilities for some fixed policy Step 2: policy improvement: update policy using one-step look-ahead with resulting utilities as future values Repeat until policy converges

Reinforcement learning Don’t know T and/or R, but can observe R Learn by doing can have multiple trials

The Story So Far: MDPs and RL Things we know how to do: If we know the MDP Compute V*, Q*, π* exactly Evaluate a fixed policy π If we don’t know T and R If we can estimate the MDP then solve We can estimate V for a fixed policy π We can estimate Q*(s,a) for the optimal policy while executing an exploration policy Techniques: Computation Value and policy iteration Policy evaluation Model-based RL sampling Model-free RL: Q-learning

Model-Free Learning Model-free (temporal difference) learning Experience world through trials (s,a,r,s’,a’,r’,s’’,a’’,r’’,s’’’…) Update estimates each transition (s,a,r,s’) Over time, updates will mimic Bellman updates

Q-Learning Q-learning produces tables of q-values:

Exploration / Exploitation Random actions (ε greedy) Every time step, flip a coin With probability ε, act randomly With probability 1-ε, act according to current policy

Today: Q-Learning with state abstraction In realistic situations, we cannot possibly learn about every single state! Too many states to visit them all in training Too many states to hold the Q-tables in memory Instead, we want to generalize: Learn about some small number of training states from experience Generalize that experience to new, similar states This is a fundamental idea in machine learning, and we’ll see it over and over again

Example: Pacman Let’s say we discover through experience that this state is bad: In naive Q-learning, we know nothing about this state or its Q-states: Or even this one!

Feature-Based Representations Solution: describe a state using a vector of features (properties) Features are functions from states to real numbers (often 0/1) that capture important properties of the state Example features: Distance to closest ghost Distance to closest dot Number of ghosts 1/ (dist to dot)2 Is Pacman in a tunnel? (0/1) …etc Can also describe a Q-state (s,a) with features (e.g. action moves closer to food) Similar to a evaluation function

Linear Feature Functions Using a feature representation, we can write a Q function for any state using a few weights: Advantage: more efficient learning from samples Disadvantage: states may share features but actually be very different in value!

Function Approximation Q-learning with linear Q-functions: transition = (s,a,r,s’) Intuitive interpretation: Adjust weights of active features E.g. if something unexpectedly bad happens, disprefer all states with that state’s features Exact Q’s Approximate Q’s

Example: Q-Pacman s Q(s,a)= 4.0fDOT(s,a)-1.0fGST(s,a) fDOT(s,NORTH)=0.5 fGST(s,NORTH)=1.0 Q(s,a)=+1 R(s,a,s’)=-500 difference=-501 wDOT←4.0+α[-501]0.5 wGST ←-1.0+α[-501]1.0 Q(s,a)= 3.0fDOT(s,a)-3.0fGST(s,a) a= North r = -500 s’

Linear Regression

Ordinary Least Squares (OLS)

Minimizing Error Imagine we had only one point x with features f(x): Approximate q update as a one-step gradient descent : “target” “prediction”

How many features should we use? As many as possible? computational burden overfitting Feature selection is important requires domain expertise

Overfitting

Overview of Project 3 MDPs Q-learning Q1: value iteration Q2: find parameters that lead to certain optimal policy Q3: similar to Q2 Q-learning Q4: implement the Q-learning algorithm Q5: implement ε greedy action selection Q6: try the algorithm Approximate Q-learning and state abstraction Q7: Pacman Tips make your implementation general