CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Reinforcement Learning

Reinforcement learning (Chapter 21)

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Reinforcement Learning & Apprenticeship Learning Chenyi Chen.

Reinforcement learning

Reinforcement Learning

Distributed Q Learning Lars Blackmore and Steve Block.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

Ai in game programming it university of copenhagen Reinforcement Learning [Intro] Marco Loog.

Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Reinforcement Learning (1)

CS 188: Artificial Intelligence Fall 2009 Lecture 12: Reinforcement Learning II 10/6/2009 Dan Klein – UC Berkeley Many slides over the course adapted from.

Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.

Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning

CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

CHAPTER 10 Reinforcement Learning Utility Theory.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

QUIZ!!  T/F: RL is an MDP but we do not know T or R. TRUE  T/F: In model free learning we estimate T and R first. FALSE  T/F: Temporal Difference learning.

CPSC 422, Lecture 8Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 8 Sep, 25, 2015.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement learning (Chapter 21)

Reinforcement Learning

CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Reinforcement Learning

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.

CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Artificial Intelligence Ch 21: Reinforcement Learning

Reinforcement Learning

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 10

Reinforcement Learning (1)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Reinforcement learning (Chapter 21)

Reinforcement Learning

Reinforcement Learning

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Dr. Unnikrishnan P.C. Professor, EEE

Reinforcement Learning

October 6, 2011 Dr. Itamar Arel College of Engineering

CS 188: Artificial Intelligence Spring 2006

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Spring 2006

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock

Reinforcement Learning In MDP, we learned how to determine an optimal sequence of actions for an agent in a stochastic environment. – An agent that knows the correct model of the environment can navigate, finding its ways to the positive rewards and avoiding the negative penalties. Reinforcement learning: – can guide the agent to an optimal policy, even though he doesn't know anything about the rewards when he starts out.

Reinforcement Learning START 1234 a b c What if we don’t know where the +100 and -100 regards are when we start? A reinforcement learning agent can learn to explore the territory, find where the rewards are, and then learn an optimal policy. An MDP solver can only do that once it knows exactly where the rewards are

RL Example Backgammon is a stochastic game In the 1990s, Gary Tesauro at IBM wrote a program to play backgammon. – #1: tried to learn the utility of a Game state, using examples that were labeled by human expert backgammon players. only a small number of states were labeled The program tried to generalize from the labels, using supervised learning – #2: no human expertise and no supervision. 1 copy of the program play against another; and at the end of the game, the winner got a positive reward, and the loser, a negative. perform at the level of the very best players in the world; learning from examples of about 200,000 games

Forms of Learning Supervised: – (x1, y1), (x2, y2) … y = f(x) Unsupervised: – X1, x2, … P(X=x) Reinforcement: – s, a, s, a, …; r Optimal policy: what is the right thing to do in any of the states

Forms of Learning Examples: (S, U, R) – Speech Recognition: examples of voice recordings, and then the transcript's intermittent text for each of those recordings; from them, I try to learn a model of language. – Star data: for each star, a list of all the different emission frequencies of light coming to earth analyzing the spectral emissions of stars and trying to find clusters of stars in dissimilar types that may be of interest to astronomers. – Lever pressing: a rat who is trained to press a lever to get a release of food when certain conditions are met – Elevator controller: a sequence of button presses, and the wait time that we are trying to minimize a bank of elevators in a building and they have to have some program, some policy, to decide which elevator goes up and which elevator goes down in response to the percepts, which would be the button presses at various floors in the building.

MDP Review Markov Decision Processes: – List of States: S1, …, Sn; – List of Actions: a1, …, ak; – State Transition Matrix: T(S, a, S’) = P(S’|a,S) – Reward function: R(S’) / R(S, a, S’) – Finding optimal policy: pi(s) look at all possible actions; choose the best one, according to the expected, in terms of probability utility.

Agents of RL Problem with MDP: – Unknown: R?? P?? Agent typeKnowLearnUse Utility-based agentPR->UU Q-learning agentQ(s, a)Q Reflex agentπ(s)π Utility-based agent: – Learn R from P, and use P, R to learn the utility function U -> MDP Q-learning agent: – Learn a utility function Q(s, a) over a pair of state and action. Reflex agent: – Learn the policy (stimulus response)

Passive and Active Agent typeKnowLearnUse Utility-based agentPR->UU Q-learning agentQ(s, a)Q Reflex agentπ(s)π Passive RL: the agent has a fixed policy and executes that policy. – e.g., Your friend are driving from Little Rock to Dallas, you learn the R (a shortcut), but you can’t change your friend’s driving behavior (policy). Active RL: change the policy as progressing – e.g., You take over the control of the car, and adjust the policy based on what you have learned. – It also gives you the possibility to explore.

Passive Temporal Difference Learning +1+1 START 1234 a b c Π, U(s), N(s), r – If s’ is new then U[s’] <- r’ – If s is not null then Increment Ns[s] U[s] <- U[s] + α(Ns[s])(r+γU[s’]-U[s]) α(): learning rate (e.g., 1/(N+1))

Passive Agent Results

Weakness TF Long convergence ?? Limited by policy ?? Missing states ?? Poor estimates ?? Problem: Fixed policy!!! +1+1 START 1234 a b c

Active RL: Greedy π <- π’: after utility update, recompute the new optimal policy. How should the agent behave? Choose action with highest expected utility? Exploration vs. exploitation: occasionally try “suboptimal” actions!! – Random?

Errors in Utility Tracking Π, U(s), N(s) Reasons for errors: – Not enough samples (random fluctuations) – Not a good policy Questions: – Make U too low ? – Make U too high ? – Improved with more N ?

Exploration Agents An exploration agent that will: – be more proactive about exploring the world when it's uncertain, and – fall back to exploiting the (sub-)optimal policy, when it becomes more certain about the world. If s’ is new then U[s’] <- r’ If s is not null then Increment Ns[s] U[s] <- U[s] + α(Ns[s])(r+γU[s’]-U[s]) U(s) = +R, when Ns < e U(s)

Exploratory agent results

Q-Learning U -> Π: – policy for each state is determined by the expected value Unknown P? – Q-learning a b c

Q-learning Q(s,a) <- Q(s,a) + α(R(s) + γQ(s’,a’) – Q(s,a)) a b c

Conclusion Know P, we can learn R, and derive U -> MDP Don’t know P or R, we can use Q-learning, where use Q(s,a) as a utility function. We learned the trade-off between exploration and exploitation.