Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Reinforcement Learning
Lecture 18: Temporal-Difference Learning
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Markov Decision Process
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Reinforcement Learning
Markov Decision Processes
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Search and Planning for Inference and Learning in Computer Vision
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
CHAPTER 10 Reinforcement Learning Utility Theory.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.
INTRODUCTION TO Machine Learning
MDPs (cont) & Reinforcement Learning
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.
Markov Decision Process (MDP)
Reinforcement learning
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Markov Decision Processes
Reinforcement Learning
CS 188: Artificial Intelligence
Reinforcement learning
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
Reinforcement Learning
CS 188: Artificial Intelligence Fall 2008
Reinforcement Learning
CS 188: Artificial Intelligence Fall 2007
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
Introduction to Reinforcement Learning and Q-Learning
Hidden Markov Models (cont.) Markov Decision Processes
CS 188: Artificial Intelligence Spring 2006
Announcements Homework 2 Project 2 Mini-contest 1 (optional)
Reinforcement Nisheeth 18th January 2019.
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Value Iteration & Q-learning CS 5368 Song Cui

Outline Recap Value Iteration Q-learning

Recap “Markov” meanings MDP: Solving MDP

Recap Utility of sequences Discount rate determines the “vision” of the agent. State-value function V π (s) Action-value function Q π (s,a)

Value Iteration How to find V k *(S): k → infinity Almost solution: recursion Correct solution: dynamic programming Value Iteration

Bellman update: Another way: V-node Q-node V-node

Value Iteration Algorithm: for i = 1,2,3….. for

Value Iteration Theorem: convergence to a optimal value Policy may converge faster Three components to return

Value Iteration Advantages compared with Expectimax: Given MDP: state space: 1,2 action: 1,2 transition:80% reward: state1 →1, state2→0 S1(1)S1(2) S1S2 S1 80% 20% S1 S1(1) S2(1)S2(2)S2(1) Repeats !

Q-learning Compared with Value Iteration: same: MDP model seeking policy different: T(s,a,s’) and R(s,a,s’) unkown different ways of solving RDP (learned model vs. unlearned model) Reinforcement Learning policy, experience, reward model-based vs. model free passive learning vs. active learning

Q-learning Value iteration: Q-values:

Q-learning Q-learning: Q-value iteration Process: sample: s →a→s’ r Update new Q-value based on the sample:

Q-learning Q-learning: converge to optimal policy Sample enough, leaning rate small enough Ways to explore: epsilon-greedy action selection : choose between acting randomly and acting accordingly to the best current Q-value