Reinforcement Learning Yishay Mansour Tel-Aviv University.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Partially Observable Markov Decision Process (POMDP)
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
An Introduction to Markov Decision Processes Sarah Hickmott
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Planning under Uncertainty
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Markov Decision Processes
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Department of Computer Science Undergraduate Events More
Reinforcement Learning (1)
Making Decisions CSE 592 Winter 2003 Henry Kautz.
©Int7elligent Agent Technology and Application, 2008, Ai Lab NJU Agent Technology Learning in Multi-agent system.
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Reinforcement Learning
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
Introduction Many decision making problems in real life
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Department of Computer Science Undergraduate Events More
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
INTRODUCTION TO Machine Learning
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Department of Computer Science Undergraduate Events More
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement Learning
Figure 5: Change in Blackjack Posterior Distributions over Time.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Reinforcement learning
Making complex decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning
Reinforcement learning
CS 188: Artificial Intelligence Fall 2007
Chapter 17 – Making Complex Decisions
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Reinforcement Nisheeth 18th January 2019.
Markov Decision Processes
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Reinforcement Learning Yishay Mansour Tel-Aviv University

2 Outline Goal of Reinforcement Learning Mathematical Model (MDP) Planning

3 Goal of Reinforcement Learning Goal oriented learning through interaction Control of large scale stochastic environments with partial knowledge. Supervised / Unsupervised Learning Learn from labeled / unlabeled examples

4 Reinforcement Learning - origins Artificial Intelligence Control Theory Operation Research Cognitive Science & Psychology Solid foundations; well established research.

5 Typical Applications Robotics –Elevator control [CB]. –Robo-soccer [SV]. Board games –backgammon [T], –checkers [S]. –Chess [B] Scheduling –Dynamic channel allocation [SB]. –Inventory problems.

6 Contrast with Supervised Learning The system has a “state”. The algorithm influences the state distribution. Inherent Tradeoff: Exploration versus Exploitation.

7 Mathematical Model - Motivation Model of uncertainty: Environment, actions, our knowledge. Focus on decision making. Maximize long term reward. Markov Decision Process (MDP)

8 Mathematical Model - MDP Markov decision processes S- set of states A- set of actions  - Transition probability R - Reward function Similar to DFA!

9 MDP model - states and actions Actions = transitions action a Environment = states

10 MDP model - rewards R(s,a) = reward at state s for doing action a (a random variable). Example: R(s,a) = -1 with probability with probability with probability 0.15

11 MDP model - trajectories trajectory: s0s0 a0a0 r0r0 s1s1 a1a1 r1r1 s2s2 a2a2 r2r2

12 MDP - Return function. Combining all the immediate rewards to a single value. Modeling Issues: Are early rewards more valuable than later rewards? Is the system “terminating” or continuous? Usually the return is linear in the immediate rewards.

13 MDP model - return functions Finite Horizon - parameter H Infinite Horizon discounted - parameter  <1. undiscounted Terminating MDP

14 MDP model - action selection Policy - mapping from states to actions Fully Observable - can “see” the “entire” state. AIM: Maximize the expected return. Optimal policy: optimal from any start state. THEOREM: There exists a deterministic optimal policy

15 Contrast with Supervised Learning Supervised Learning: Fixed distribution on examples. Reinforcement Learning: The state distribution is policy dependent!!! A small local change in the policy can make a huge global change in the return.

16 MDP model - summary - set of states, |S|=n. - set of k actions, |A|=k. - transition function. - immediate reward function. - policy. - discounted cumulative return. R(s,a)

17 Simple example: N- armed bandit Single state. a1a1 a2a2 a3a3 s Goal: Maximize sum of immediate rewards. Difficulty: unknown model. Given the model: Greedy action.

18 N-Armed Bandit: Highlights Algorithms (near greedy): –Exponential weights G i sum of rewards of action a i w i = e G i –Follow the (perturbed) leader Results: –For any sequence of T rewards: –E[online] > max i {G i } - sqrt{T log N}

19 Planning - Basic Problems. Policy evaluation - Given a policy , estimate its return. Optimal control - Find an optimal policy  *  (maximizes the return from any start state). Given a complete MDP model.

20 Planning - Value Functions V  (s) The expected return starting at state s following  Q  (s,a) The expected return starting at state s with action a and then following  V  (s) and Q  (s,a) are define using an optimal policy  . V  (s) = max  V  (s)

21 Planning - Policy Evaluation Discounted infinite horizon (Bellman Eq.) Rewrite the expectation Linear system of equations. V  (s) = E s’~  (s) [ R(s,  (s)) +  V  (s’)]

22 Algorithms - Policy Evaluation Example A={+1,-1}  = 1/2  (s i,a)= s i+a  random s0s0 s1s1 s3s3 s2s V  (s 0 ) = 0 +  [  s 0,+1)V  (s 1 ) +  s 0,-1) V  (s 3 ) ]  a: R(s i,a) = i

23 Algorithms -Policy Evaluation Example A={+1,-1}  = 1/2  (s i,a)= s i+a  random s0s0 s1s1 s3s3 s2s2  a: R(s i,a) = i V  (s 0 ) = 0 + (V  (s 1 ) + V  (s 3 ) )/4 V  (s 0 ) = 5/3 V  (s 1 ) = 7/3 V  (s 2 ) = 11/3 V  (s 3 ) = 13/3

24 Algorithms - optimal control State-Action Value function: Note Q  (s,a)  E [ R(s,a)] +  E s’~ (s,a)  V  (s’)] For a deterministic policy .

25 Algorithms -Optimal control Example A={+1,-1}  = 1/2  (s i,a)= s i+a  random s0s0 s1s1 s3s3 s2s2 R(s i,a) = i Q  (s 0,+1) = 0 +  V  (s 1 ) Q  (s 0,+1) = 5/6 Q  (s 0,-1) = 13/6

26 Algorithms - optimal control CLAIM: A policy  is optimal if and only if at each state s: V  (s)  MAX a  Q  (s,a)} (Bellman Eq.) PROOF: Assume there is a state s and action a s.t., V  (s)  Q  (s,a). Then the strategy of performing a at state s (the first time) is better than  This is true each time we visit s, so the policy that performs action a at state s is better than  

27 Algorithms -optimal control Example A={+1,-1}  = 1/2  (s i,a)= s i+a  random s0s0 s1s1 s3s3 s2s2 R(s i,a) = i Changing the policy using the state-action value function.

28 Algorithms - optimal control The greedy policy with respect to Q  (s,a) is  (s) = argmax a {Q  (s,a) } The  -greedy policy with respect to Q  (s,a) is  (s) = argmax a {Q  (s,a) } with probability 1-  and  (s) = random action with probability 

29 MDP - computing optimal policy 1. Linear Programming 2. Value Iteration method. 3. Policy Iteration method.

30 Convergence Value Iteration –Drop in distance from optimal each iteration max s {V * (s) – V t (s)} Policy Iteration –Policy can only improve  s V t+1 (s)  V t (s) Less iterations then Value Iteration, but more expensive iterations.

31 Relations to Board Games state = current board action = what we can play. opponent action = part of the environment value function = probability of winning Q- function = modified policy. Hidden assumption: Game is Markovian

32 Planning versus Learning Tightly coupled in Reinforcement Learning Goal: maximize return while learning.

33 Example - Elevator Control Planning (alone) : Given arrival model build schedule Learning (alone): Model the arrival model well. Real objective: Construct a schedule while updating model

34 Partially Observable MDP Rather than observing the state we observe some function of the state. Ob - Observable function. a random variable for each states. Problem: different states may “look” similar. The optimal strategy is history dependent ! Example: (1) Ob(s) = s+noise. (2) Ob(s) = first bit of s.

35 POMDP - Belief State Algorithm Given a history of actions and observable value we compute a posterior distribution for the state we are in (belief state). The belief-state MDP: States: distribution over S (states of the POMDP). actions: as in the POMDP. Transition: the posterior distribution (given the observation) We can perform the planning and learning on the belief-state MDP.

36 POMDP - Hard computational problems. Computing an infinite (polynomial) horizon undiscounted optimal strategy for a deterministic POMDP is P-space- hard (NP-complete) [PT,L]. Computing an infinite (polynomial) horizon undiscounted optimal strategy for a stochastic POMDP is EXPTIME- hard (P-space-complete) [PT,L]. Computing an infinite (polynomial) horizon undiscounted optimal policy for an MDP is P-complete [PT].

37 Resources Reinforcement Learning (an introduction) [Sutton & Barto] Markov Decision Processes [Puterman] Dynamic Programming and Optimal Control [Bertsekas] Neuro-Dynamic Programming [Bertsekas & Tsitsiklis] Ph. D. thesis - Michael Littman