Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

Slides:



Advertisements
Similar presentations
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Batch RL Via Least Squares Policy Iteration
Partially Observable Markov Decision Process (POMDP)
1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin.
Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.
Multi-Task Compressive Sensing with Dirichlet Process Priors Yuting Qi 1, Dehong Liu 1, David Dunson 2, and Lawrence Carin 1 1 Department of Electrical.
An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection for Reinforcement Learning Ronald Parr, Lihong Li, Gavin Taylor,
Machine Learning and Data Mining Clustering
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Infinite Horizon Problems
STANFORD Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion J. Zico Kolter, Pieter Abbeel, Andrew Y. Ng Goal Initial Position.
12 June, STD( ): learning state temporal differences with TD( ) Lex Weaver Department of Computer Science Australian National University Jonathan.
Reinforcement Learning Tutorial
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Matrix sparsification and the sparse null space problem Lee-Ad GottliebWeizmann Institute Tyler NeylonBynomial Inc. TexPoint fonts used in EMF. Read the.
Discretization Pieter Abbeel UC Berkeley EECS
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Department of Computer Science Undergraduate Events More
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Overview of Kernel Methods Prof. Bennett Math Model of Learning and Discovery 2/27/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Projection Methods (Symbolic tools we have used to do…) Ron Parr Duke University Joint work with: Carlos Guestrin (Stanford) Daphne Koller (Stanford)
Retraction: I’m actually 35 years old. Q-Learning.
Department of Computer Science Undergraduate Events More
Department of Computer Science Undergraduate Events More
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison March 3, 2016 TexPoint.
Kernelized Value Function Approximation for Reinforcement Learning Gavin Taylor and Ronald Parr Duke University.
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement Learning
Multiplicative updates for L1-regularized regression
István Szita & András Lőrincz
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Reinforcement learning (Chapter 21)
Nonnegative polynomials and applications to learning
"Playing Atari with deep reinforcement learning."
UAV Route Planning in Delay Tolerant Networks
Chapter 3: The Reinforcement Learning Problem
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Chapter 3: The Reinforcement Learning Problem
Reinforcement Learning
CS 188: Artificial Intelligence Fall 2008
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Reinforcement Learning (2)
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford University June 16 th, ICML 2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A AAAA A A A

Outline RL with (linear) function approximation –Least-squares temporal difference (LSTD) algorithms very effective in practice –But, when number of features is large, can be expensive and over-fit to training data

Outline RL with (linear) function approximation –Least-squares temporal difference (LSTD) algorithms very effective in practice –But, when number of features is large, can be expensive and over-fit to training data This work: present method for feature selection in LSTD (via L1 regularization) –Introduce notion of L1-regularized TD fixed points, and develop an efficient algorithm

Outline Feature Selection Methods Supervised Learning Greedy Methods Convex (L1) Methods

Outline Feature Selection Methods Supervised Learning Greedy Methods Convex (L1) Methods Reinforcement Learning

Outline Feature Selection Methods Supervised Learning Greedy Methods Convex (L1) Methods Reinforcement Learning (e.g., Parr et al., 2007)

Outline Feature Selection Methods Supervised Learning Greedy Methods Convex (L1) Methods Reinforcement Learning (e.g., Parr et al., 2007) This paper

RL with Least-Squares Temporal Difference

Problem Setup Markov chain M = (S, R, P,  ) –Set of states S –Reward function R(s) –Transition Probabilities P(s’|s) –Discount factor  Want to compute the value function for the Markov chain

Problem Setup Markov chain M = (S, R, P,  ) –Set of states S –Reward function R(s) –Transition Probabilities P(s’|s) –Discount factor  Want to compute the value function for the Markov chain But, problem is hard because: 1) Don’t know the true state transitions / reward (only have access to samples) 2) State space is too large to represent the value function explicitly

TD Algorithms Temporal difference (TD) family of algorithms (Sutton, 1988) addresses this problem setting In particular, focus on Least-Squares Temporal Difference (LSTD) algorithms (Bradtke and Barto, 1996; Boyan, 1999, Lagoudakis and Parr, 2003) –work well in practice, make efficient use of data

Brief LSTD Overview Represent value function using linear approximation

Brief LSTD Overview Represent value function using linear approximation parameter vector

Brief LSTD Overview Represent value function using linear approximation state features

Brief LSTD Overview TD methods seek parameters w that satisfy the following fixed-point equation

Brief LSTD Overview TD methods seek parameters w that satisfy the following fixed-point equation optimization variable

Brief LSTD Overview TD methods seek parameters w that satisfy the following fixed-point equation matrix of all state features

Brief LSTD Overview TD methods seek parameters w that satisfy the following fixed-point equation vector of all rewards

Brief LSTD Overview TD methods seek parameters w that satisfy the following fixed-point equation Matrix of transition probabilities

Brief LSTD Overview TD methods seek parameters w that satisfy the following fixed-point equation Also sometimes written (equivalently) as

Brief LSTD Overview TD methods seek parameters w that satisfy the following fixed-point equation Also sometimes written (equivalently) as LSTD finds a w that approximately satisfies this equation using only samples from the MDP (gives closed form expression for optimal w)

Problems with LSTD Requires storing/inverting k x k matrix –Can be extremely slow for large k –In practice, often means that practitioner puts great effort into picking a few “good” features For many features / few samples, LSTD can over-fit to training data

Regularization and Feature Selection for LSTD

Regularized LSTD Introduce regularization term into LSTD fixed point equation

Regularized LSTD Introduce regularization term into LSTD fixed point equation In particular, focus on L1 regularization –Encourages sparsity in feature weights (i.e., feature selection) –Avoids over-fitting to training samples –Avoids storing/inverting full k x k matrix

Regularized LSTD Solution Unfortunately, for L1 regularized LSTD –No closed-form solution for optimal w –Optimal w cannot even be expressed as solution to convex optimization problem

Regularized LSTD Solution Unfortunately, for L1 regularized LSTD –No closed-form solution for optimal w –Optimal w cannot even be expressed as solution to convex optimization problem Fortunately, can be solved efficiently using algorithm similar to Least Angle Regression (LARS) (Efron et al., 2004)

LARS-TD Algorithm Intuition of our algorithm (LARS-TD) –Express L1-regularized fixed point in terms of optimality conditions for convex problem –Then, beginning at fully regularized solution (w=0), proceed down regularization path (piecewise linear adjustments to w, which can be computed analytically) –Stop when we reach the desired amount of regularization

Theoretical Guarantee Theorem: Under certain conditions (similar to those required to show convergence of ordinary TD) the L1-reguarlized fixed point exists and is unique, and the LARS-TD algorithm is guaranteed to find this fixed point.

Computational Complexity LARS-TD algorithm has computational complexity of approximately O(kp 3 ) –k = number of total features –p = number of non-zero features (<< k) Importantly, algorithm is linear in number of total features

Experimental Results

Chain Domain 20 state chain domain (Lagoudakis and Parr, 2003) –Twenty states, two actions, use LARS-TD for LSPI-style policy iteration –Five “relevant” features: RBFs –Varying number of irrelevant Gaussian noise features

Chain – 1000 Irrelevant Features

Chain – 800 Samples

Mountain Car Domain Classic Mountain Car Domain –500 training samples from 50 episodes –1365 basis functions (automatically generated RBFs w/ many different bandwidth parameters)

Mountain Car Domain Classic Mountain Car Domain –500 training samples from 50 episodes –1365 basis functions (automatically generated RBFs w/ many different bandwidth parameters) Algorithm LARS-TDLSTD Success %100%0% Time Per Iteration 1.20 s3.42s

Related Work RL feature selection / generation: (Menache et al., 2005), (Keller et al., 2006), (Parr et al., 2007), (Loth et al., 2007), (Parr et al., 2008) Regularization: (Farahmand et al., 2009) Kernel selection: (Jung and Polani, 2006), (Xu et al., 2007)

Summary LSTD able to learn value function approximation using only samples from MDP, but can be computationally expensive and/or over-fit to data Present feature selection framework for LSTD (using L1 regularization) –Encourages sparse solutions, prevents over- fitting, computationally efficient

Thank you! Extended paper (with full proofs) available at: