Reinforcement Learning & Apprenticeship Learning Chenyi Chen.

Slides:

Advertisements

Similar presentations

Markov Decision Process

Advertisements

Decision Theoretic Planning

Class Project Due at end of finals week Essentially anything you want, so long as it’s AI related and I approve Any programming language you want In pairs.

Reinforcement Learning

1. Algorithms for Inverse Reinforcement Learning 2

Reinforcement learning (Chapter 21)

An Introduction to Markov Decision Processes Sarah Hickmott

Markov Decision Processes

Planning under Uncertainty

Reinforcement learning

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?

Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Department of Computer Science Undergraduate Events More

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Instructor: Vincent Conitzer

MAKING COMPLEX DEClSlONS

1 Endgame Logistics  Final Project Presentations  Tuesday, March 19, 3-5, KEC2057  Powerpoint suggested ( to me before class)  Can use your own.

Reinforcement Learning

Introduction Many decision making problems in real life

Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.

CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.

Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

Reinforcement learning (Chapter 21)

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

Department of Computer Science Undergraduate Events More

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.

Reinforcement Learning

Reinforcement Learning

Deep Reinforcement Learning

Making complex decisions

A Crash Course in Reinforcement Learning

Mastering the game of Go with deep neural network and tree search

Reinforcement learning (Chapter 21)

Reinforcement learning (Chapter 21)

Markov Decision Processes

Deep reinforcement learning

Classification with Perceptrons Reading:

Reinforcement Learning

"Playing Atari with deep reinforcement learning."

Markov Decision Processes

Markov Decision Processes

Apprenticeship Learning via Inverse Reinforcement Learning

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

Chapter 17 – Making Complex Decisions

Reinforcement Learning (2)

Reinforcement Learning (2)

Presentation transcript:

Reinforcement Learning & Apprenticeship Learning Chenyi Chen

Markov Decision Process (MDP) What’s MDP? A sequential decision problem Fully observable, stochastic environment Markovian transition model: the n th state is only determined by (n-1) th state and (n- 1) th action Each state has a reward, and the reward is additive

Markov Decision Process (MDP) State s: a representation of current environment;

Markov Decision Process (MDP) Example: Tom and Jerry, control Jerry (Jerry’s perspective) State: the position of Tom and Jerry, 25*25=625 in total; One of the states

Markov Decision Process (MDP) State s: a representation of current environment; Action a: the action can be taken by the agent in state s;

Markov Decision Process (MDP) Example: Tom and Jerry, control Jerry (Jerry’s perspective) State: the position of Tom and Jerry, 25*25=625 in total; Action: both can move to the neighboring 8 squares or stay; One of the states

Markov Decision Process (MDP) State s: a representation of current environment; Action a: the action can be taken by the agent in state s; Reward R(s): the reward of current state s (+,-,0); Value (aka utility) of state s: different from reward, related with future optimal actions;

An Straightforward Example 100 bucks if you came to class Reward of “come to class” is 100 You can use the money to: Eat food (you only have 50 bucks left) Stock market (you earn 1000 bucks, including the invested 100 bucks) The value (utility) of “come to class” is 1000

Markov Decision Process (MDP) Example: Tom and Jerry, control Jerry (Jerry’s perspective) State: the position of Tom and Jerry, 25*25=625 in total; Action: both can move to the neighboring 8 squares or stay; Reward: 1) Jerry and cheese at the same square, +5; 2) Tom and Jerry at the same square, -20; 3) otherwise 0; One of the states

Markov Decision Process (MDP) State s: a representation of current environment; Action a: the action can be taken by the agent in state s; Reward R(s): the reward of current state s (+,-,0); Value (aka utility) of state s: different from reward, related with future optimal actions; Transition probability P(s’|s,a): given the agent is in state s and taking action a, the probability of reaching state s’ in the next step;

Markov Decision Process (MDP) Example: Tom and Jerry, control Jerry (Jerry’s perspective) State: the position of Tom and Jerry, 25*25=625 in total; Action: both can move to the neighboring 8 squares or stay; Reward: 1) Jerry and cheese at the same square, +5; 2) Tom and Jerry at the same square, -20; 3) otherwise 0; Transition probability: about Tom’s moving pattern. One of the states

Markov Decision Process (MDP) Example: Tom and Jerry, control Jerry (Jerry’s perspective) …

Markov Decision Process (MDP) State s: a representation of current environment; Action a: the action can be taken by the agent in state s; Reward R(s): the reward of current state s (+,-,0); Value (aka utility) of state s: different from reward, related with future optimal actions; Transition probability P(s’|s,a): given the agent is in state s and taking action a, the probability of reaching state s’ in the next step; Policy π(s)->a: a table of state-action pairs, given state s, output action a that should be taken.

Bellman Equation

Value Iteration

Reinforcement Learning Similar to MDPs But we assume the environment model (transition probability P(s’|s,a) ) is unknown

Reinforcement Learning How to solve it? Solution #1: Use Monte Carlo method to sample the transition probability, then implement Value Iteration limitation: too slow for problems with many possible states because it ignores frequencies of states

Monte Carlo Method A broad class of computational algorithms that rely on repeated random sampling to obtain numerical results; Typically one runs simulations many times in order to obtain the distribution of an unknown probabilistic entity. From Wikipedia

Monte Carlo Example

Reinforcement Learning How to solve it? Solution #1: Use Monte Carlo method to sample the transition probability, then implement Value Iteration limitation: too slow for problems with many possible states because it ignores frequencies of states Solution #2: Q-learning the major algorithm for reinforcement learning

Q-learning

Playing Atari with Deep Reinforcement Learning The Atari 2600 is a video game console released in September 1977 by Atari, Inc. Atari emulator: Arcade Learning Environment (ALE)

What did they do? Train a deep learning convolutional neural network Input is current state (raw image sequence) Output is all the legal action and corresponding Q(s,a) value Let the CNN play Atari games

What’s Special? Input is raw image! Output is the action! Game independent, same convolutional neural network for all games Outperform human expert players in some games

Problem Definition

A Variant of Q-learning In the paper:

Deep Learning Approach Approach the Q-value with a convolutional neural network Q(s,a;θ) VS Straightforward structure The structure used in the paper

How to Train the Convolutional Neural Network? Loss function: Q-value is defined as: Where:

Some Details The distribution of action a (ε-greedy policy): choose a “best” action with probability 1- ε, and selects a random action with probability ε, ε annealed linearly from 1 to 0.1 Input image preprocessing function φ(s t ) Build a huge database to store historical samples

During Training… Play the game for one step Do mini-batch gradient descent on parameter θ for one step Add new data sample to database

CNN Training Pipeline

After Training… Play the game

Results Screen shots from five Atari 2600 games: (Left-to-right) Beam Rider, Breakout, Pong, Seaquest, Space Invaders Comparison of average total reward for various learning methods by running an ε-greedy policy with ε = 0.05 for a fixed number of steps

Results The leftmost plot shows the predicted value function for a 30 frame segment of the game Seaquest. The three screenshots correspond to the frames labeled by A, B, and C respectively

Apprenticeship Learning via Inverse Reinforcement Learning Teach the computer to do something by demonstration, rather than by telling it the rules or reward Reinforcement Learning: tell computer the reward, let it learn by itself using the reward Apprenticeship Learning: demonstrate to the computer, let it mimic the performance

Why Apprenticeship Learning? For standard MDPs, a reward for each state needs to be specified Specify a reward some time is not easy, what’s the reward for driving? When teaching people to do something (e.g. driving), usually we prefer to demonstrate rather than tell them the reward function

How Does It Work? Reward is unknown, but we assume it’s a linear function of features, is a function mapping state s to features, so:

Example of Feature State s t of the red car is defined as: s t ==1 left lane, s t ==2 middle lane, s t ==3 right lane Feature φ(s t ) is defind as: [1 0 0] left lane, [0 1 0] middle lane, [0 0 1] right lane w is defined as: w=[ ] R(left lane)=0.1, R(middle lane)=0.5, R(right lane)=0.3 So in this case staying in the middle lane is preferred

How Does It Work? Reward is unknown, but we assume it’s a linear function of features, is a function mapping state s to features, so: The value (utility) of policy π is:

How Does It Work? Define feature expectation as: Then: Assume the expert’s demonstration defines the optimal policy: We need to sample the expert’s feature expectation by (sample m times):

What Does Feature Expectation Look Like? State s t of the red car is defined as: s t ==1 left lane, s t ==2 middle lane, s t ==3 right lane Feature φ(s t ) is defind as: [1 0 0] left lane, [0 1 0] middle lane, [0 0 1] right lane During sampling, assume γ=0.9 Step 1, red car in middle lane μ=0.9^0*[0 1 0]=[0 1 0] Step 2, red car still in middle lane μ= [0 1 0]+0.9^1*[0 1 0]=[ ] Step 3, red car move to left lane μ= [ ]+0.9^2*[1 0 0]=[ ] …

How Does It Work? We want to mimic the expert’s performance by minimize the difference between and If we have, and assuming Then

Pipeline

Supporting Vector Machine (SVM) The 2 nd step of the pipeline is a SVM problem Which can be rewritten as:

Pipeline

Their Testing System

Demo Videos Driving StyleExpertLearned Controller Both (Expert left, Learned right) 1: Niceexpert1.avilearnedcontroller1.avijoined1.avi 2: Nastyexpert2.avilearnedcontroller2.avijoined2.avi 3: Right lane niceexpert3.avilearnedcontroller3.avijoined3.avi 4: Right lane nastyexpert4.avilearnedcontroller4.avijoined4.avi 5: Middle laneexpert5.avilearnedcontroller5.avijoined5.avi

Their Results

Questions?