Reinforcement Learning

Slides:

Advertisements

Similar presentations

Markov Decision Process

Advertisements

RL for Large State Spaces: Value Function Approximation

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Reinforcement Learning

Reinforcement learning (Chapter 21)

Probability CSE 473 – Autumn 2003 Henry Kautz. ExpectiMax.

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.

Reinforcement Learning Tutorial

Reinforcement Learning

Nov 14 th  Homework 4 due  Project 4 due 11/26.

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Reinforcement Learning

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed.

Distributed Q Learning Lars Blackmore and Steve Block.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Reinforcement Learning (1)

Making Decisions CSE 592 Winter 2003 Henry Kautz.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Machine Learning Chapter 13. Reinforcement Learning

Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

Introduction Many decision making problems in real life

Reinforcement Learning

Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Reinforcement Learning

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday 29 October 2002 William.

Class 2 Please read chapter 2 for Tuesday’s class (Response due by 3pm on Monday) How was Piazza? Any Questions?

INTRODUCTION TO Machine Learning

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

1 Introduction to Reinforcement Learning Freek Stulp.

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

Reinforcement learning (Chapter 21)

CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.

Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10

COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Reinforcement Learning

Reinforcement Learning in POMDPs Without Resets

Markov Decision Processes

Markov Decision Processes

Announcements Homework 3 due today (grace period through Friday)

Dr. Unnikrishnan P.C. Professor, EEE

Reinforcement Learning

Reinforcement Learning

Presentation transcript:

Reinforcement Learning Chapter 13 Reinforcement Learning What is Reinforcement Learning? Q-Learning Examples

Machine Learning Categories

What’s reinforcement Learning? An autonomous agent should learn to choose optimal actions in each state to achieve its goals. The agent learns how to achieve that goal by trial-and-error interactions with its environment.

Example: Learning to ride a bike Suppose: In the first trial, the RL system begins riding the bicycle and performs a series of actions that result in the bicycle being tilted 45 degrees to the right. At this point, there are two possible actions: turn the handle bars right: crashing to the ground (a negative reinforcement) turn the handle bars left:

Example: Learning to ride a bike At this point, the RL system has not only learned that turning the handle bars right or left when tilted 45 degrees to the right is bad, but that the "state" of being titled 45 degrees to the right is bad. Again, the RL system begins another trial and performs a series of actions that result in the bicycle being tilted 40 degrees to the right. ……

Reinforcement Learning: Suitable for state-action problems Board games: E.g. backgammon, chess, 8-puzzle, … (Reinforcement learning in board games., Imran Ghory, 2004) s0 s2 s1 s5 s6 s7 s3 s8 a5 a4 a1 a2 a3 a6 a7

What’s reinforcement Learning? s : state a : action r : a reward function s0 s1 Agent environment State Reward a0 r0 r1 s2 r2 a1 Action a2 control policy : S -> A

Example: TD-Gammon Tesauro (1995) RL to play Backgammon to become the world championship Immediate reward +100 if win -100 if lose 0 for all other states Trained by playing 1.5 million games against itself Now approximately equal to best human player

An Example of Reward Function

The Goal in Reinforcement Learning Goal: learn to choose actions that maximize: r0 +  r1 + 2 r2 + … , where 0  < 1 The discount factor  is used to exponentially decrease the weight of reinforcements received in the future It’s called: Discounted Cumulative Reward

Discounted Cumulative Reward  =0.9

Other Options Finite-horizon model: Average-reward model: Average discounted reward model:

Different Types of Learning Tasks Agent’s actions: Deterministic, or Nondeterministic Agent may have or haven’t the ability of predicting the next state that will result from each action Trainer of the agent: Expert (who shows it examples of optimal action sequences), or agent itself(train itself by performing actions of its own choice.)

Q-Learning for Simple Deterministic Worlds

example Q(s1, aright)  r +  𝑚𝑎𝑥 𝑎’ Q (s2 , 𝑎’)  0 + 0.9 max{63,81,100}  90

RL as a function approximation method Learning the control policy (𝜋:𝑆→𝐴) is very similar to the function approximation problem, except: Delayed reward In RL, The trainer provides only a sequence of immediate reward values => Facing the problem of temporal credit assignment. Exploration or Exploitation (next slide) Exploration to collect new information, or Exploitation of what it already learned to maximize the cumulative rewards. In RL, the agents influence the distribution of training examples by the action sequence it chooses.

Explore or Exploit? In Q-learning, there is no mention about how to choose an action among possible actions, some obtions: Random uniform selction High Q-value selection Selection based on the following probability: Small k => exploration, large k => exploitation, Common choice: small k at the beginning of the learning process, then gradually increasing k

RL Vs. other function approximation (continued) Partially Observable States In many practical situations, the sensors provide only partial information (like the camera in front of a robot). Solution: considering previous observations together with the current sensor data Life-long Learning Unlike the function approximation task, in RL, robots need to learn many task simultaneously plus online learning process forever.

RL Convergence Proved in p 377-378, Mitchell. Three conditions of convergency: Deterministic Markov Decision Process (MDP) Immediate positive bounded rewards Agent selects every agent-action pairs infinitely often.

Markov Decision Process Finite set of States : S; Set of Actions: A t: discrete time step; st: the state at time t; at: the action at time t; At each discrete time, agent observe states st S, and chooses action at A. Then receive immediate reward: rt , And state change to: st+1 Markov assumption: st+1= (st , at ), rt=r (st , at ) i.e., rt, and st+1 depend only on current state and action Functions  and r may be nondeterministic Functions  and r not necessarily be known to agent st at rt st+1 rt+1 st+2 rt+2 at+1 at+2 …

Other issues in RL (p. 381 - 386) Reinforcement Learning for non-deterministic rewards and actions Temporal Difference Learning Generalizing from examples Relationship to dynamic programming Continuous reinforcement learning (state-of-the-art)

Homework 13.3 Tik-Tak-Toe