Reinforcement learning

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Reinforcement Learning Peter Bodík. Previous Lectures Supervised learning –classification, regression Unsupervised learning –clustering, dimensionality.
Hierarchical Reinforcement Learning Amir massoud Farahmand
Thomas Trappenberg Autonomous Robotics: Supervised and unsupervised learning.
Lecture 18: Temporal-Difference Learning
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Process
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Reinforcement Learning
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
ONLINE Q-LEARNER USING MOVING PROTOTYPES by Miguel Ángel Soto Santibáñez.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Temporal Difference Learning By John Lenz. Reinforcement Learning Agent interacting with environment Agent receives reward signal based on previous action.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Reinforcement Learning
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Verve: A General Purpose Open Source Reinforcement Learning Toolkit Tyler Streeter, James Oliver, & Adrian Sannier ASME IDETC & CIE, September 13, 2006.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Reinforcement Learning
INTRODUCTION TO Machine Learning
1 Introduction to Reinforcement Learning Freek Stulp.
MDPs (cont) & Reinforcement Learning
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Chapter 6 Neural Network.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Learning Deep Generative Models by Ruslan Salakhutdinov
A Crash Course in Reinforcement Learning
Chapter 6: Temporal Difference Learning
Reinforcement Learning
An Overview of Reinforcement Learning
Markov Decision Processes
Deep reinforcement learning
Reinforcement Learning
Autonomous Cyber-Physical Systems: Reinforcement Learning for Planning
"Playing Atari with deep reinforcement learning."
UAV Route Planning in Delay Tolerant Networks
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Reinforcement learning
Dr. Unnikrishnan P.C. Professor, EEE
یادگیری تقویتی Reinforcement Learning
Reinforcement Learning
Chapter 6: Temporal Difference Learning
CS 188: Artificial Intelligence Spring 2006
Deep Reinforcement Learning
Reinforcement Nisheeth 18th January 2019.
Reinforcement Learning (2)
Reinforcement Learning (2)
What is Artificial Intelligence?
Presentation transcript:

Reinforcement learning Thomas Trappenberg

Three kinds of learning: Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed teacher that provides desired output y for a given input x: training set {x,y}  find appropriate mapping function y=h(x;w) [= W j(x) ] Unlabeled samples are provided from which the system has to figure out good representations: training set {x}  find sparse basis functions bi so that x=Si ci bi Delayed feedback from the environment in form of reward/ punishment when reaching state s with action a: reward r(s,a)  find optimal policy a=p*(s) Most general learning circumstances

Maximize expected Utility www.koerding.com

2. Reinforcement learning -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 From Russel and Norvik

Markov Decision Process (MDP)

Two important quantities policy: value function: Goal: maximize total expected payoff Optimal Control

Calculate value function (dynamic programming) Deterministic policies to simplify notation Bellman Equation for policy p Solution: Analytic or Incremental Richard Bellman 1920-1984

Remark on different formulations: Some (like Sutton, Alpaydin, but not Russel & Norvik) define the value as the reward at the next state plus all the following reward: instead of

Policy Iteration

Value Iteration Bellman Equation for optimal policy

Solution: But: Environment not known a priori Observability of states Curse of Dimensionality  Online (TD)  POMDP  Model-based RL

POMDP: Partially observable MDPs can be reduced to MDPs by considering believe states b:

What if the environment is not completely known ? Online value function estimation (TD learning) If the environment is not known, use Monte Carlo method with bootstrapping Expected payoff before taking step Expected reward after taking step = actual reward plus discounted expected payoff of next step Temporal Difference

Online optimal control: Exploitation versus Exploration On-policy TD learning: Sarsa Off-policy TD learning: Q-learning

Model-based RL: TD(1) Instead of tabular methods as mainly discussed before, use function approximator with parameters q and gradient descent step (Satton 1988): For example by using a neural network with weights q and corresponding delta learning rule when updating the weights after an episode of m steps. The only problem is that we receive the feedback r only after the t-th step. So we need to keep a memory (trace) of the sequence.

Model-based RL: TD(1) … alternative formulation We can write An putting this into the formula and rearranging the sum gives We still need to keep the cumulative sum of the derivative terms, but otherwise it looks already closer to bootstrapping.

Model-based RL: TD(l) We now introduce a new algorithm by weighting recent gradients more than ones in the distance This is called the TD(l) rule. For l=1 we recover the TD(1) rule. Interesting is also the the other extreme of TD(0) Which uses the prediction of V(t+1) as supervision signal for step t. Otherwise this is equivalent to supervised learning and can easily be generalized to hidden layer networks.

Free-Energy-Based RL: This can be generalized to Boltzmann machines (Sallans & Hinton 2004) Paul Hollensen: Sparse, topographic RBM successfully learns to drive the e-puck and avoid obstacles, given training data (proximity sensors, motor speeds)

Classical Conditioning Ivan Pavlov 1849-1936 Nobel Prize 1904 Classical Conditioning Rescorla-Wagner Model (1972)

Reward Signals in the Brain Wolfram Schultz Stimulus A No reward Stimulus B Stimulus A Reward

Disorders with effects On dopamine system: Parkinson’s disease Tourett’s syndrome ADHD Drug addiction Schizophrenia Maia & Frank 2011

Conclusion and Outlook Three basic categories of learning: Supervised: Lots of progress through statistical learning theory Kernel machines, graphical models, etc Unsupervised: Hot research area with some progress, deep temporal learning Reinforcement: Important topic in animal behavior, model-based RL