Kernelized Value Function Approximation for Reinforcement Learning Gavin Taylor and Ronald Parr Duke University.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Process
Batch RL Via Least Squares Policy Iteration
Pattern Recognition and Machine Learning
1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin.
Pattern Recognition and Machine Learning: Kernel Methods.
Computer vision: models, learning and inference Chapter 8 Regression.
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection for Reinforcement Learning Ronald Parr, Lihong Li, Gavin Taylor,
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Infinite Horizon Problems
Planning under Uncertainty
Pattern Recognition and Machine Learning
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Octopus Arm Mid-Term Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel.
Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Orthogonality and Least Squares
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu.
Transfer in Variable - Reward Hierarchical Reinforcement Learning Hui Li March 31, 2006.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Gaussian Processes Li An Li An
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
CORRECTIONS L2 regularization ||w|| 2 2, not ||w|| 2 Show second derivative is positive or negative on exams, or show convex – Latter is easier (e.g. x.
Projection Methods (Symbolic tools we have used to do…) Ron Parr Duke University Joint work with: Carlos Guestrin (Stanford) Daphne Koller (Stanford)
 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Evolving RBF Networks via GP for Estimating Fitness Values using Surrogate Models Ahmed Kattan Edgar Galvan.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions by S. Mahadevan & M. Maggioni Discussion led by Qi An ECE, Duke University.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
1 Kernel-class Jan Recap: Feature Spaces non-linear mapping to F 1. high-D space 2. infinite-D countable space : 3. function space (Hilbert.
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
3. Linear Models for Regression 後半 東京大学大学院 学際情報学府 中川研究室 星野 綾子.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
A Crash Course in Reinforcement Learning
Reinforcement learning (Chapter 21)
Graph Based Multi-Modality Learning
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
Filtering and State Estimation: Basic Concepts
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Reinforcement Learning
CS 188: Artificial Intelligence Spring 2006
Deep Reinforcement Learning
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Kernelized Value Function Approximation for Reinforcement Learning Gavin Taylor and Ronald Parr Duke University

Overview Solve for value function given kernelized model Solve for model as in GPRL Kernelized Model Kernel: k(s,s’) Training Data: (s,r,s’),(s,r,s’) (s,r,s’)… Solve for value directly using KLSTD or GPTD V=Kw Kernelized Value Function

Overview - Contributions Construct new model-based VFA Equate novel VFA with previous work Decompose Bellman Error into reward and transition error Use decomposition to understand VFA reward error transition error Bellman Error SamplesModelVFA

Outline Motivation, Notation, and Framework Kernel-Based Models –Model-Based VFA –Interpretation of Previous Work Bellman Error Decomposition Experimental Results and Conclusions

Markov Reward Processes M =( S, P, R,  ) Value: V(s) =expected, discounted sum of rewards from state s Bellman equation: Bellman equation in matrix notation:

Kernels Properties: –Symmetric function between two points: –PSD K -matrix Uses: –Dot-product in high-dimensional space (kernel trick) –Gain expressiveness Risks: –Overfitting –High computational cost

Outline Motivation, Notation, and Framework Kernel-Based Models –Model-Based VFA –Interpretation of Previous Work Bellman Error Decomposition Experimental Results and Conclusions

Kernelized Regression Apply kernel trick to least-squares regression t : target values K : kernel matrix, where k(x) : column vector, where : regularization matrix

Kernel-Based Models Approximate reward model Approximate transition model –Want to predict k(s’) (not s’ ) –Construct matrix K’, where SamplesModelVFA

Model-based Value Function SamplesModelVFA

Model-based Value Function SamplesModelVFA Unregularized: Regularized: Whole state space:

Previous Work Kernel Least-Squares Temporal Difference Learning (KLSTD) [Xu et. al., 2005] –Rederive LSTD, replacing dot products with kernels –No regularization Gaussian Process Temporal Difference Learning (GPTD) [Engel, et al., 2005] –Model value directly with a GP Gaussian Processes in Reinforcement Learning (GPRL) [Rasmussen and Kuss, 2004] –Model transitions and value with GPs –Deterministic reward SamplesModelVFA

Equivalency MethodValue FunctionModel-based Equivalent KLSTD GPTD GPRL Model- based [T&P `09] : GPTD noise parameter : GPRL regularization parameter SamplesModelVFA

Outline Motivation, Notation, and Framework Kernel-Based Models –Model-Based VFA –Interpretation of Previous Work Bellman Error Decomposition Experimental Results and Conclusions

Model Error Error in reward approximation: Error in transition approximation: : expected next kernel values : approximate next kernel values

Bellman Error reward error transition error Bellman Error a linear combination of reward and transition errors

Outline Motivation, Notation, and Framework Kernel-Based Models –Model-Based VFA –Interpretation of Previous Work Bellman Error Decomposition Experimental Results and Conclusions

Experiments Version of two room problem [Mahadevan & Maggioni, 2006] Use Bellman Error decomposition to tune regularization parameters REWARD

Experiments

Conclusion Novel, model-based view of kernelized RL built around kernel regression Previous work differs from model-based view only in approach to regularization Bellman Error can be decomposed into transition and reward error Transition and reward error can be used to tune parameters

Thank you!

What about policy improvement? Wrap policy iteration around kernelized VFA –Example: KLSPI –Bellman error decomposition will be policy dependent –Choice of regularization parameters may be policy dependent Our results do not apply to SARSA variants of kernelized RL, e.g., GPSARSA

What’s left? Kernel selection –Kernel selection (not just parameter tuning) –Varying kernel parameters across states –Combining kernels (See Kolter & Ng ‘09) Computation costs in large problems –K is O(#samples) –Inverting K is expensive –Role of sparsification, interaction w/regularization

Comparing model-based approaches Transition model –GPRL: models s’ as a GP –T&P: approximates k(s’) given k(s) Reward model –GPRL: deterministic reward –T&P: reward approximated with regularized, kernelized regression

Don’t you have to know the model? For our experiments & graphs: Reward, transition errors calculated with true R, K’ In practice: Cross-validation could be used to tune parameters to minimize reward and transition errors

Why is the GPTD regularization term asymmetric? GPTD is equivalent to T&P when Can be viewed as propagating the regularizer through the transition model – –Is this a good idea? –Our contribution: Tools to evaluate this question

What about Variances? Variances can play an important role in Bayesian interpretations of kernelized RL –Can guide exploration –Can ground regularization parameters Our analysis focuses on the mean Variances a valid topic for future work

Does this apply to the recent work of Farahmand et al.? Not directly All methods assume ( s,r,s’ ) data Farahmand et al. include next states ( s’’ ) in their kernel, i.e., k(s’’,s) and k(s’’,s’) Previous work, and ours, includes only s’ in the kernel: k(s’,s)

How is This Different from Parr et al. ICML 2008? Parr et al. considers linear fixed point solutions, not kernelized methods Equivalence between linear fixed point methods was fairly well understood already Our contribution: –We provide a unifying view of previous kernel-based methods –We extend the equivalence between model-based and direct methods to the kernelized case