A Crash Course in Reinforcement Learning

Slides:



Advertisements
Similar presentations
Reinforcement learning
Advertisements

Markov Decision Process
brings-uas-sensor-technology-to- smartphones/ brings-uas-sensor-technology-to-
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
An Introduction to Markov Decision Processes Sarah Hickmott
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Introduction Many decision making problems in real life
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
1 Markov Decision Processes Basics Concepts Alan Fern.
Department of Computer Science Undergraduate Events More
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Reinforcement learning (Chapter 21)
Reinforcement Learning
Markov Decision Process (MDP)
MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement Learning
On-Line Markov Decision Processes for Learning Movement in Video Games
Reinforcement Learning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Reinforcement learning
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Reinforcement learning (Chapter 21)
Reinforcement Learning
Reinforcement Learning
Deep reinforcement learning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14
"Playing Atari with deep reinforcement learning."
UAV Route Planning in Delay Tolerant Networks
Course Logistics CS533: Intelligent Agents and Decision Making
Reinforcement learning
Dr. Unnikrishnan P.C. Professor, EEE
RL for Large State Spaces: Value Function Approximation
CS 188: Artificial Intelligence Fall 2007
Reinforcement Learning
CS 188: Artificial Intelligence Fall 2008
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
یادگیری تقویتی Reinforcement Learning
September 22, 2011 Dr. Itamar Arel College of Engineering
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
Reinforcement Learning Dealing with Partial Observability
Reinforcement learning
Reinforcement Nisheeth 18th January 2019.
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning (2)
CS 440/ECE448 Lecture 22: Reinforcement Learning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

A Crash Course in Reinforcement Learning Oliver Schulte Simon Fraser University emphasis on connections with neural net learning

Outline What is Reinforcement Learning? Key Definitions Key Learning Tasks Reinforcement Learning Techniques Reinforcement Learning with Neural Nets

Overview

Learning To Act So far: learning to predict Now: learn to act In engineering: control theory Economics, operations research: decision and game theory Examples: fly helicopter drive car play Go play soccer

RL at a glance http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/mlbook/ch13.pdf

Acting in Action Autonomous Helicopter Learning to play video games An example of imitation learning: start by observing human actions Learning to play video games “Deep Q works best when it lives in the moment” Learn to flip pancakes https://www.youtube.com/watch?v=VCdxqn0fcnE helicopter https://www.wired.com/2015/02/google-ai-plays-atari-like-pros/

MARKOV DECISION PROCESSES

Markov Decision Processes Recall Markov process (MP) state = vector x ≅ s of input variable values can contain hidden variables = partially observable (POMDP) transition probability P(s’|s) Markov reward process (MRP) = MP + rewards r Markov decision process (MDP) = MRP + actions a Markov game = MDP with actions, rewards for > 1 agent

Model Parameters: transition probabilities Markov process: P(s(t+1)|s(t)) MDP: P(s(t+1)|s(t),a(t)) E(r(t+1)|s(t),a(t)) expected reward recall basketball example also hockey example grid example David Poole’s demo

derived concepts

Returns and discounting A trajectory is a (possibly infinite) sequence s(0),a(0),r(0),s(1),a(1),r(1),...,s(n),a(n),r(n),... The return is the total sum of rewards. But: if the trajectory is infinite, we have an infinite sum! Solution: Weight by discount factor γ between 0 and 1. Return = r(0)+γr(1)+γ2r(n)+... can also interpret as probability of process eding

RL Concepts These 3 functions can be computed by neural networks http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/mlbook/ch13.pdf

Policies and Values A deterministic policy π is a function that maps states to actions. i.e. tells us how to act. Can also be probabilistic. Can be implemented using neural nets. Given a policy and an MDP, we have the expected return from using the policy at a state. Notation: Vπ(s)

Optimal Policies A policy π* is optimal if for any other policy and for all states s Vπ*(s) ≥ Vπ(s) The value of the optimal policy is written as V*(s).

The action value function Given a policy, the expected reward at a state given an action is denoted as Qπ(s,a). Similarly Q*(s,a) for the value of an action given the optimal policy. grid example Show Mitchell example

LEARNING

Two Learning Problems Prediction: For a fixed policy, learn Vπ(s). Control: For a given MDP, learn V*(s) (optimal policy). Variants for Q-function.

Model-Based Learning Transition Probabilities Value Function Data dynamic programming Transition Probabilities Value Function Data Bellmann equation: Vπ(s) = Ps’,a π(a) x ( E(r)|s,a + P(s’|s,a) x Vπ (s’) ) Developed for transition probabilities that are “nice” discrete, Gaussian, Poisson,... grid example

Model-free Learning By-pass estimating transition probabilities Why? Continuous state variables, no “nice” functional form. (How about using LSTM/RNN dynamic model? deep dynamic programming?)

Model-free Learning Directly learn optimal policy π* (policy iteration) Directly learn optimal value function V*. Directly learn optimal action-value function Q*. All of these functions can be implemented in a neural network. NN learning = reinforcement learning

Model-free Learning: What are the data? Data is simply a sequence of events s(0),a(0),r(0),s(1),a(1),r(1),... doesn’t tell us expected values or optimal actions. Monte Carlo learning: to learn V, observe return at end of episode. e.g. chessbase gives percentage of wins by white for any position

Temporal Difference Learning Consistency idea: using current model, and given data, s(0),a(0),r(0),s(1),a(1),r(1), estimate the value V(s(t)) at current state the next-step value V1(s(t)) = r(t)+γV(s(t+1)) Minimize the “error” [V1(s(t))-V(s(t))]2 s(0),a(0),r(0),s(1),a(1),r(1),

Model-Free Learning Example http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html