Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Slides:

Advertisements

Similar presentations

Markov Decision Process

Advertisements

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.

Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.

David Wingate Reinforcement Learning for Complex System Management.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

Reinforcement Learning Tutorial

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Application of Reinforcement Learning in Network Routing By Chaopin Zhu Chaopin Zhu.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Reinforcement Learning. Overview  Introduction  Q-learning  Exploration vs. Exploitation  Evaluating RL algorithms  On-Policy Learning: SARSA.

Machine Learning Lecture 11: Reinforcement Learning

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Reinforcement Learning

CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.

INTRODUCTION TO Machine Learning

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé By Sutton.

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

1 Introduction to Reinforcement Learning Freek Stulp.

Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,

MDPs (cont) & Reinforcement Learning

gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

Retraction: I’m actually 35 years old. Q-Learning.

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

Reinforcement Learning Elementary Solution Methods

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Markov Decision Process (MDP)

Reinforcement learning

Chapter 6: Temporal Difference Learning

CMSC 471 – Spring 2014 Class #25 – Thursday, May 1

Reinforcement Learning

An Overview of Reinforcement Learning

Markov Decision Processes

Biomedical Data & Markov Decision Process

Reinforcement Learning

”بسمه تعالي“ سمينار درس MPC

Reinforcement learning

یادگیری تقویتی Reinforcement Learning

Reinforcement Learning

October 6, 2011 Dr. Itamar Arel College of Engineering

Chapter 6: Temporal Difference Learning

Chapter 10: Dimensions of Reinforcement Learning

Introduction to Reinforcement Learning and Q-Learning

Presentation transcript:

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Outline Introduction to Reinforcement Learning Overview of the field Model-based BRL Model-free RL

References ICML-07 Tutorial –P. Poupart, M. Ghavamzadeh, Y. Engel Reinforcement Learning: An Introduction –Richard S. Sutton and Andrew G. Barto

Machine Learning Unsupervised Learning Reinforcement Learning Supervised Learning

Definitions StateActionReward Policy £££££ Reward function

Markov Decision Process x0x0 a0a0 x1x1 Policy Transition Probability r0r0 a1a1 r1r1 Reward function

Value Function

Optimal Policy Assume one optimal action per state Unknown Value Iteration

Reinforcement Learning RL Problem: Solve MDP when reward/transition models are unknown Basic Idea: Use samples obtained from agent’s interaction with environment

Model-Based vs Model-Free RL Model-Based: Learn a model of the reward/transition dynamics and derive value function/policy Model-Free: Directly learn value function/policy

RL Solutions

Value Function Algorithms –Define a form for the value function –Sample state-action-reward sequence –Update value function –Extract optimal policy SARSA, Q-learning

RL Solutions Actor-Critic –Define a policy structure (actor) –Define a value function (critic) –Sample state-action-reward –Update both actor & critic

RL Solutions Policy Search Algorithm –Define a form for the policy –Sample state-action-reward sequence –Update policy PEGASUS –(Policy Evaluation-of-Goodness And Search Using Scenarios)

Online - Offline Offline –Use a simulator –Policy fixed for each ‘episode’ –Updates made at the end of episode Online –Directly interact with environment –Learning happens step-by-step

Model-Free Solutions 1.Prediction: Estimate V(x) or Q(x,a) 2.Control: Extract policy On-Policy Off-Policy

Monte-Carlo Predictions Value Reward -13 Leave car parkGet out of cityMotorway Enter Cambridge State Updated

Temporal Difference Predictions Value Reward -13 Leave car parkGet out of cityMotorway Enter Cambridge State Updated

Advantages of TD Don’t need a model of reward/transitions Online, fully incremental Proved to converge given conditions on step-size “Usually” faster than MC methods

From TD to TD(λ) State Reward Terminal state

From TD to TD(λ) State Reward Terminal state

SARSA & Q-learning TD-Learning SARSA Q-Learning On-Policy Estimate value function for current policy Off-Policy Estimate value function for optimal policy

GP Temporal Difference x x x x x x x x x x x 1 2

x x x x x x x x x x x 12