Reinforcement Learning

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Markov Decision Process
Genetic Algorithms (Evolutionary Computing) Genetic Algorithms are used to try to “evolve” the solution to a problem Generate prototype solutions called.
Reinforcement Learning
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Class Project Due at end of finals week Essentially anything you want, so long as it’s AI related and I approve Any programming language you want In pairs.
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Planning under Uncertainty
Reinforcement learning
Reinforcement Learning
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement Learning
Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Introduction Presented by Alp Sardağ.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Ai in game programming it university of copenhagen Reinforcement Learning [Intro] Marco Loog.
© Daniel S. Weld 1 Reinforcement Learning CSE 473 Ever Feel Like Pavlov’s Poor Dog?
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Reinforcement Learning
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
© Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA.
Reinforcement learning (Chapter 21)
Reinforcement Learning
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement learning (Chapter 21)
Reinforcement learning (Chapter 21)
Reinforcement Learning
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning
Announcements Homework 3 due today (grace period through Friday)
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Spring 2006
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Reinforcement Learning
Reinforcement Learning (2)
Presentation transcript:

Reinforcement Learning Chapter 21 Vassilis Athitsos

Reinforcement Learning In previous chapters: Learning from examples. Reinforcement learning: Learning what to do. Learning to fly (a helicopter). Learning to play a game. Learning to walk. Learning based on rewards.

Relation to MDPs Feedback can be provided at the end of the sequence of actions, or more frequently. Compare chess and ping-pong. No complete model of environment. Transitions may be unknown. Reward function unknown.

Agents Utility-based agent: Q-learning agent: Reflex agent: Learns utility function on states. Q-learning agent: Learns utility function on (action, state) pairs. Reflex agent: Learns function mapping states to actions.

Passive Reinforcement Learning Assume fully observable environment. Passive learning: Policy is fixed (behavior does not change). The agent learns how good each state is. Similar to policy evaluation, but: Transition function and reward function or unknown. Why is it useful?

Passive Reinforcement Learning Assume fully observable environment. Passive learning: Policy is fixed (behavior does not change). The agent learns how good each state is. Similar to policy evaluation, but: Transition function and reward function or unknown. Why is it useful? For future policy revisions.

Direct Utility Estimation For each state the agent ever visits: For each time the agent visits the state: Keep track of the accumulated rewards from the visit onwards. Similar to inductive learning: Learning a function on states using samples. Weaknesses: Ignores correlations between utilities of neighboring states. Converges very slowly.

Adaptive Dynamic Programming Learns transitions and state utilities. Plugs values into Bellman equations. Solves equations with linear algebra, or policy iteration. Problem:

Adaptive Dynamic Programming Learns transitions and state utilities. Plugs values into Bellman equations. Solves equations with linear algebra, or policy iteration. Problem: Intractable for large number of states. Example: backgammon. 1050 equations, with 1050 unknowns.

Temporal Difference Every time we make a transition from state s to state s’: Update utility of s’: U[s’] = current observed reward. Update utility of s: U[s] = (1-a)U[s] + a (r + g U[s’] ). a: learning rate r: previous reward g: discount factor

Properties of Temporal Difference What happens when an unlikely transition occurs?

Properties of Temporal Difference What happens when an unlikely transition occurs? U[s] becomes a bad approximation of true utility. However, U[s] is rarely a bad approximation. Average value of U[s] converges to correct value. If a decreases over time, U[s] converges to correct value.

Hybrid Methods ADP: TD: An intermediate approach: More accurate, slower, intractable for large numbers of states. TD: Less accurate, faster, tractable. An intermediate approach:

Hybrid Methods ADP: TD: An intermediate approach: Pseudo-experiences: More accurate, slower, intractable for large numbers of states. TD: Less accurate, faster, tractable. An intermediate approach: Pseudo-experiences: Imagine transitions that have not happened. Update utilities according to those transitions.

Hybrid Methods Making ADP more efficient: Do a limited number of adjustments after each transition. Use estimated transition probabilities to identify the most useful adjustments.

Active Reinforcement Learning Using passive reinforcement learning, utilities of states and transition probabilities are learned. Those utilities and transitions can be plugged into Bellman equations. Problem?

Active Reinforcement Learning Using passive reinforcement learning, utilities of states and transition probabilities are learned. Those utilities and transitions can be plugged into Bellman equations. Problem? Bellman equations give optimal solutions given correct utility and transition functions. Passive reinforcement learning produces approximate estimates of those functions. Solutions?

Exploration/Exploitation The goal is to maximize utility. However, utility function is only approximately known. Dilemma: should the agent Maximize utility based on current knowledge, or Try to improve current knowledge.

Exploration/Exploitation The goal is to maximize utility. However, utility function is only approximately known. Dilemma: should the agent Maximize utility based on current knowledge, or Try to improve current knowledge. Answer: A little of both.

Exploration Function U[s] = R[s] + g max {f(Q(a,s), N(a, s))}. R[s]: current reward. g: discount factor. Q(a,s): estimated utility of performing action a in state s. N(a, s): number of times action a has been performed in state s. f(u, n): preference according to utility and degree of exploration so far for (a, s). Initialization: U[s] = optimistically large value.

Q-learning Learning utility of state-action pairs. U[s] = max Q(a, s). Learning can be done using TD: Q(a,s) = (1-b) Q(a,s) + b(R(s) + g max(Q(a’, s’)) b: learning factor g: discount factor s’: next state a’: possible action at next state

Generalization in Reinforcement Learning How do we apply reinforcement learning problems with huge numbers of states (chess, backgammon, …)?

Generalization in Reinforcement Learning How do we apply reinforcement learning problems with huge numbers of states (chess, backgammon, …)? Solution similar to estimating probabilities of a huge number of events:

Generalization in Reinforcement Learning How do we apply reinforcement learning problems with huge numbers of states (chess, backgammon, …)? Solution similar to estimating probabilities of a huge number of events: Learn parametric functions, where parameters are features of each state. Example: chess. 20 features adequate for describing the current board.

Learning Parametric Utility Functions For Backgammon First approach: Design weighted linear functions of 16 terms. Collect training set of board states. Ask human experts to evaluate training states. Result: Program not competitive with human experts. Collecting training data was very tedious.

Learning Parametric Utility Functions For Backgammon Second approach: Design weighted linear functions of 16 terms. Let the system play against itself. Reward provided at the end of each game. Result (after 300,000 games, a few weeks): Program competitive with best players in the world.