Reinforcement learning

Slides:



Advertisements
Similar presentations
Reinforcement Learning Peter Bodík. Previous Lectures Supervised learning –classification, regression Unsupervised learning –clustering, dimensionality.
Advertisements

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Reinforcement Learning Tutorial
Reinforcement Learning
INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Application of Reinforcement Learning in Network Routing By Chaopin Zhu Chaopin Zhu.
Nov 14 th  Homework 4 due  Project 4 due 11/26.
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Reinforcement Learning Yishay Mansour Tel-Aviv University.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Reinforcement Learning (1)
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Machine Learning Chapter 13. Reinforcement Learning
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Neural Networks Chapter 7
INTRODUCTION TO Machine Learning
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
MDPs (cont) & Reinforcement Learning
Reinforcement Learning
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Figure 5: Change in Blackjack Posterior Distributions over Time.
Reinforcement Learning
Reinforcement learning
Reinforcement Learning
A Crash Course in Reinforcement Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning in POMDPs Without Resets
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement learning (Chapter 21)
Reinforcement Learning
Autonomous Cyber-Physical Systems: Reinforcement Learning for Planning
"Playing Atari with deep reinforcement learning."
Reinforcement learning
Chapter 3: The Reinforcement Learning Problem
Dr. Unnikrishnan P.C. Professor, EEE
Reinforcement Learning
Chapter 3: The Reinforcement Learning Problem
Reinforcement Learning
Reinforcement Learning
CS 188: Artificial Intelligence Spring 2006
Presentation transcript:

Reinforcement learning Szepesvári Csaba: Megerősítéses tanulás (2004) Szita István, Lőrincz András: Megerősítéses tanulás (2005) Richard S. Sutton and Andrew G. Barto: Reinforcement Learning: An Introduction (1998)

Reinforcement learning http://www.youtube.com/watch?v=mRpX9DFCdwI http://www.youtube.com/watch?v=VCdxqn0fcnE

Reinforcement learning

Reinforcement learning

Reinforcement learning Pavlov: Nomad 200 robot Nomad 200 simulator Sridhar Mahadevan UMass

Reinforcement learning Controll tasks Planning of multiple actions Learning from interaction Objective: maximising reward (i.e. task-specific) +50 -1 +3 r9 r5 r4 r1 … … s1 s2 s3 s4 s5 … s9 a1 a2 a3 a4 a5 … a9

Supervised vs Reinforcement learning Both are machine learning Supervised Reinforcement Prompt supervision Late, indirect reinforcement Passive learnng (training dataset is given) Active learning (actions taken by the system which will be then reinforced)

Reinforcement learning time: states: actions: reward: policy (strategy): deterministic: stochastic: (s,a) is the likelihood that we choose action a being in state s (infinate horizon)

process: model of the environment: transition probabilites and reward objective: find a policy which maximises the expected value of total reward

Markov assumption → the dynamics of the system can be given by:

Markov Decision Processes (MDPs) Stochastic transitions a1 r = 0 1 1 2 r = 2 a2

Markov Decision Process environment changes according to P and R agent takes an action: we are looking for the optimal policy  which maximises

The exploration – exploitation dilemma The k-armed bandit bandit Avg. reward rewards 10 0, 0, 5, 10, 35 5, 10, -15, -15, -10 -5 -20, 0, 50 agent 100 Maximising the reward on a long-term we have to explore the world’s dynamics then we can exploit this knowladge and collect reward.

Jack's Car Rental Problem: Jack manages two locations for a nationwide car rental company. Each day, some number of customers arrive at each location to rent cars. If Jack has a car available, he rents it out and is credited $10 by the national company. If he is out of cars at that location, then the business is lost. Cars become available for renting the day after they are returned. To help ensure that cars are available where they are needed, Jack can move them between the two locations overnight, at a cost of $2 per car moved. We assume that the number of cars requested and returned at each location are Poisson random variables with parameter λ. Suppose λ is 3 and 4 for rental requests at the first and second locations and 3 and 2 for returns. To simplify the problem slightly, we assume that there can be no more than 20 cars at each location (any additional cars are returned to the nationwide company, and thus disappear from the problem) and a maximum of five cars can be moved from one location to the other in one night. We take the discount rate to be 0.9 and formulate this as an MDP, where the time steps are days, the state is the number of cars at each location at the end of the day, and the actions are the net numbers of cars moved between the two locations overnight.

Regression-based reinforcement learning if the number of states and/or actions is high it’ll be intractable Continous state and/or action spaces:

TD-gammon TD learning, neural network with with 1 single layer 1,500,000 game between variants achieved the level of world champion Backgammon state space: ~1020 , DP doesn’t work!!

AlphaGo Zero approximate policy iteration deep learning (79 layers) 5M self-playing games Go state space: ~10170