Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Reinforcement Learning
Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.
Markov Decision Process
Genetic Algorithms (Evolutionary Computing) Genetic Algorithms are used to try to “evolve” the solution to a problem Generate prototype solutions called.
Reinforcement Learning
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Eick: Reinforcement Learning. Topic 18: Reinforcement Learning 1. Introduction 2. Bellman Update 3. Temporal Difference Learning 4. Discussion of Project1.
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Class Project Due at end of finals week Essentially anything you want, so long as it’s AI related and I approve Any programming language you want In pairs.
Reinforcement Learning
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Reinforcement learning
Reinforcement Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Introduction Presented by Alp Sardağ.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
Reinforcement Learning Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Reinforcement learning (Chapter 21)
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Sutton & Barto, Chapter 4 Dynamic Programming. Programming Assignments? Course Discussions?
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Any Questions? Programming Assignments?. CptS 450/516 Singularity Weka / Torch / RL-Glue/Burlap/RLPy Finite MDP = ? Start arbitrarily, moving towards.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Artificial Intelligence Ch 21: Reinforcement Learning
Figure 5: Change in Blackjack Posterior Distributions over Time.
Markov Decision Process (MDP)
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Teaching Style COSC 6368 Teaching Style COSC 6368
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning
Reinforcement learning
Chapter 2: Evaluative Feedback
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Spring 2006
Chapter 2: Evaluative Feedback
Reinforcement Learning
Presentation transcript:

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Introduction Supervised Learning: Example Class Reinforcement Learning: Situation Reward …

Examples Playing chess: Reward comes at end of game Ping-pong: Reward on each point scored Animals: Hunger and pain - negative reward food intake – positive reward

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Passive Learning We assume the policy Π is fixed. In state s we always execute action Π(s) Rewards are given.

Typical Trials (1,1)  (1,2)  (1,3)  (1,2)  (1,3) …  (4,3) +1 Goal: Use rewards to learn the expected utility U Π (s)

Expected Utility U Π (s) = E [ Σ t=0 γ R(s t ) | Π, S 0 = s ] Expected sum of rewards when the policy is followed.

Example (1,1)  (1,2)  (1,3)  (2,3)  (3,3)  (4,3) +1 Total reward: (-0.04 x 5) + 1 = 0.80

Direct Utility Estimation Convert the problem to a supervised learning problem: (1,1)  U = 0.72 (2,1)  U = 0.68 … Learn to map states to utilities. But utilities are not independent of each other!

Bellman Equations Utility values obey the following equations: U Π (s) = R(s) + γ Σ s’ T(s,s’) U Π (s’) Can be solved using dynamic programming. Assumes knowledge of model.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Temporal Difference Learning Use the following update rule: U Π (s)  U Π (s) + α [ R(s) + γ U Π (s’) - U Π (s) ] α is the learning rate Temporal difference equation. No model assumption.

Example U(1,3) = 0.84 U(2,3) = 0.92 We hope to see that: U(1,3) = [U(2,3) – U(1,3)] U(1,3) = (0.92 – 0.84) The value is Current value is a bit low and we must increase it.

Considerations Update values toward the equilibrium equation. Update includes the successor only. Over many trials the updates converge toward optimal values.

Other heuristics Prioritized Sweeping: Make adjustments to states where the most probable successors have undergone a large adjustment in terms of utility estimates.

Richard Sutton Author of classic textbook: “Reinforcement Learning” by Sutton and Barto, MIT Press, Dept. of Computer Science University of Alberta

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Active Reinforcement Learning Now we must decide what actions to take. Optimal policy: Choose action with highest utility value. Is that the right thing to do?

Active Reinforcement Learning No! Sometimes we may get stuck in suboptimal solutions. Exploration vs Exploitation Tradeoff Why is this important? The learned model is not the same as the true environment.

Explore vs Exploit Exploitation: Maximize its reward vs Exploration: Maximize long-term well being.

Bandit Problem An n-armed bandit has n levers. Which lever to play to maximize reward? In genetic algorithms the selection strategy is to allocate coins optimally given appropriate set of assumptions.

Solution U + (s)  R(s) + γ max a f(u,N(a,s)) U + (s) : optimistic estimate of utility N(a,s): number of times action a has been tried. f(u,n): exploration function. Increasing in u (exploitation) Decreasing in n (exploration)

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Applications Game Playing Checker playing program by Arthur Samuel (IBM) Update rules: change weights by difference between current states and backed-up value generating full look-ahead tree

Applications Robot Control Cart-pole balancing problem. Control the position of x so that the pole stays roughly upright.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Goal is to learn utility values and an optimal mapping from states to actions. Direct Utility Estimation ignores dependencies among states. We must follow Bellman Equations. Temporal difference updates values to match those of successor states. Active reinforcement learning learns optimal mapping from states to actions.

Video