Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Slides:

Advertisements

Similar presentations

Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.

Advertisements

Reinforcement Learning

Lecture 18: Temporal-Difference Learning

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Markov Decision Process

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.

Reinforcement Learning

Markov Decision Processes

SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010

CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Search and Planning for Inference and Learning in Computer Vision

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

CHAPTER 10 Reinforcement Learning Utility Theory.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.

INTRODUCTION TO Machine Learning

MDPs (cont) & Reinforcement Learning

CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.

CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.

Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.

Markov Decision Process (MDP)

Reinforcement learning

Reinforcement Learning (1)

Reinforcement learning (Chapter 21)

Markov Decision Processes

Reinforcement Learning

CS 188: Artificial Intelligence

Reinforcement learning

CAP 5636 – Advanced Artificial Intelligence

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.

CS 188: Artificial Intelligence Fall 2007

CS 188: Artificial Intelligence Fall 2008

Reinforcement Learning

CS 188: Artificial Intelligence Fall 2008

Reinforcement Learning

CS 188: Artificial Intelligence Fall 2007

Chapter 17 – Making Complex Decisions

CS 188: Artificial Intelligence Spring 2006

Introduction to Reinforcement Learning and Q-Learning

Hidden Markov Models (cont.) Markov Decision Processes

CS 188: Artificial Intelligence Spring 2006

Announcements Homework 2 Project 2 Mini-contest 1 (optional)

Reinforcement Nisheeth 18th January 2019.

Reinforcement Learning (2)

Markov Decision Processes

Markov Decision Processes

Reinforcement Learning (2)

Presentation transcript:

Value Iteration & Q-learning CS 5368 Song Cui

Outline Recap Value Iteration Q-learning

Recap “Markov” meanings MDP: Solving MDP

Recap Utility of sequences Discount rate determines the “vision” of the agent. State-value function V π (s) Action-value function Q π (s,a)

Value Iteration How to find V k *(S): k → infinity Almost solution: recursion Correct solution: dynamic programming Value Iteration

Bellman update: Another way: V-node Q-node V-node

Value Iteration Algorithm: for i = 1,2,3….. for

Value Iteration Theorem: convergence to a optimal value Policy may converge faster Three components to return

Value Iteration Advantages compared with Expectimax: Given MDP: state space: 1,2 action: 1,2 transition:80% reward: state1 →1, state2→0 S1(1)S1(2) S1S2 S1 80% 20% S1 S1(1) S2(1)S2(2)S2(1) Repeats !

Q-learning Compared with Value Iteration: same: MDP model seeking policy different: T(s,a,s’) and R(s,a,s’) unkown different ways of solving RDP (learned model vs. unlearned model) Reinforcement Learning policy, experience, reward model-based vs. model free passive learning vs. active learning

Q-learning Value iteration: Q-values:

Q-learning Q-learning: Q-value iteration Process: sample: s →a→s’ r Update new Q-value based on the sample:

Q-learning Q-learning: converge to optimal policy Sample enough, leaning rate small enough Ways to explore: epsilon-greedy action selection : choose between acting randomly and acting accordingly to the best current Q-value