Deep reinforcement learning

Slides:



Advertisements
Similar presentations
Lecture 18: Temporal-Difference Learning
Advertisements

RL for Large State Spaces: Value Function Approximation
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Reinforcement learning (Chapter 21)
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
ConvNets for Image Classification
Understanding AlphaGo. Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19.
Reinforcement Learning
Deep Reinforcement Learning
Reinforcement Learning
Continuous Control with Prioritized Experience Replay
Stochastic tree search and stochastic games
Reinforcement Learning
Reinforcement Learning
Summary of “Efficient Deep Learning for Stereo Matching”
Introduction of Reinforcement Learning
Deep Reinforcement Learning
A Comparison of Learning Algorithms on the ALE
Adversarial Learning for Neural Dialogue Generation
Reinforcement Learning
A Crash Course in Reinforcement Learning
Chapter 6: Temporal Difference Learning
Mastering the game of Go with deep neural network and tree search
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Chapter 5: Monte Carlo Methods
AlphaGo with Deep RL Alpha GO.
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement learning (Chapter 21)
Reinforcement Learning
AlphaGO from Google DeepMind in 2016, beat human grandmasters
Machine Learning Basics
Deep Reinforcement Learning
Reinforcement learning with unsupervised auxiliary tasks
"Playing Atari with deep reinforcement learning."
Human-level control through deep reinforcement learning
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
Dr. Unnikrishnan P.C. Professor, EEE
RL for Large State Spaces: Value Function Approximation
یادگیری تقویتی Reinforcement Learning
Double Dueling Agent for Dialogue Policy Learning
Reinforcement Learning
Reinforcement Learning
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
CS 188: Artificial Intelligence Spring 2006
Deep Reinforcement Learning
CS 188: Artificial Intelligence Fall 2008
Designing Neural Network Architectures Using Reinforcement Learning
These neural networks take a description of the Go board as an input and process it through 12 different network layers containing millions of neuron-like.
Deep Reinforcement Learning: Learning how to act using a deep neural network Psych 209, Winter 2019 February 12, 2019.
Introduction to Neural Networks
Reinforcement Learning (2)
Function approximation
CS249: Neural Language Model
Reinforcement Learning (2)
CS 440/ECE448 Lecture 22: Reinforcement Learning
Morteza Kheirkhah University College London
Presentation transcript:

Deep reinforcement learning

Function approximation So far, we’ve assumed a lookup table representation for utility function U(s) or action-utility function Q(s,a) This does not work if the state space is really large or continuous Alternative idea: approximate the utilities or Q values using parametric functions and automatically learn the parameters:

Deep Q learning Train a deep neural network to output Q values: Source: D. Silver

Deep Q learning Regular TD update: “nudge” Q(s,a) towards the target Deep Q learning: encourage estimate to match the target by minimizing squared error: target estimate

Deep Q learning Regular TD update: “nudge” Q(s,a) towards the target Deep Q learning: encourage estimate to match the target by minimizing squared error: Compare to supervised learning: Key difference: the target in Q learning is also moving! target estimate

Online Q learning algorithm Observe experience (s,a,s’) Compute target Update weights to reduce the error Gradient: Weight update: This is called stochastic gradient descent (SGD)

Dealing with training instability Challenges Target values are not fixed Successive experiences are correlated and dependent on the policy Policy may change rapidly with slight changes to parameters, leading to drastic change in data distribution Solutions Freeze target Q network Use experience replay Mnih et al. Human-level control through deep reinforcement learning, Nature 2015

Experience replay At each time step: Take action at according to epsilon-greedy policy Store experience (st, at, rt+1, st+1) in replay memory buffer Randomly sample mini-batch of experiences from the buffer Mnih et al. Human-level control through deep reinforcement learning, Nature 2015

Keep parameters of target network fixed, update every once in a while Experience replay At each time step: Take action at according to epsilon-greedy policy Store experience (st, at, rt+1, st+1) in replay memory buffer Randomly sample mini-batch of experiences from the buffer Perform update to reduce objective function Keep parameters of target network fixed, update every once in a while Mnih et al. Human-level control through deep reinforcement learning, Nature 2015

Deep Q learning in Atari Mnih et al. Human-level control through deep reinforcement learning, Nature 2015

Deep Q learning in Atari End-to-end learning of Q(s,a) from pixels s Output is Q(s,a) for 18 joystick/button configurations Reward is change in score for that step Q(s,a1) Q(s,a2) Q(s,a3) . Q(s,a18) Mnih et al. Human-level control through deep reinforcement learning, Nature 2015

Deep Q learning in Atari Input state s is stack of raw pixels from last 4 frames Network architecture and hyperparameters fixed for all games Mnih et al. Human-level control through deep reinforcement learning, Nature 2015

Deep Q learning in Atari The performance of DQN is normalized with respect to a professional human games tester (that is, 100% level) and random play (that is, 0% level). Note that the normalized performance of DQN, expressed as a percentage, is calculated as: 100 × (DQN score − random play score)/(human score − random play score). Mnih et al. Human-level control through deep reinforcement learning, Nature 2015

Breakout demo https://www.youtube.com/watch?v=TmPfTpjtdgg

Policy gradient methods Learning the policy directly can be much simpler than learning Q values We can train a neural network to output stochastic policies, or probabilities of taking each action in a given state Softmax policy:

Actor-critic algorithm Define objective function as total discounted reward: The gradient for a stochastic policy is given by Actor network update: Critic network update: use Q learning (following actor’s policy) Actor network Critic network

Advantage actor-critic The raw Q value is less meaningful than whether the reward is better or worse than what you expect to get Introduce an advantage function that subtracts a baseline number from all Q values Estimate V using a value network Advantage actor-critic:

Asynchronous advantage actor-critic (A3C) Agent 1 Experience 1 Local updates Local updates Agent 2 Experience 2 . V, π Local updates Agent n Experience n Asynchronously update global parameters Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016

Asynchronous advantage actor-critic (A3C) Mean and median human-normalized scores over 57 Atari games Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016

Asynchronous advantage actor-critic (A3C) TORCS car racing simulation video Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016

Asynchronous advantage actor-critic (A3C) Motor control tasks video Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016

Playing Go Go is a known (and deterministic) environment Therefore, learning to play Go involves solving a known MDP Key challenges: huge state and action space, long sequences, sparse rewards

Review: AlphaGo Policy network: initialized by supervised training on large amount of human games Value network: trained to predict outcome of game based on self-play Networks are used to guide Monte Carlo tree search (MCTS) D. Silver et al., Mastering the Game of Go with Deep Neural Networks and Tree Search, Nature 529, January 2016

https://deepmind.com/blog/alphago-zero-learning-scratch/ AlphaGo Zero A fancier architecture (deep residual networks) No hand-crafted features at all A single network to predict both value and policy Train network entirely by self-play, starting with random moves Uses MCTS inside the reinforcement learning loop, not outside D. Silver et al., Mastering the Game of Go without Human Knowledge, Nature 550, October 2017 https://deepmind.com/blog/alphago-zero-learning-scratch/

https://deepmind.com/blog/alphago-zero-learning-scratch/ AlphaGo Zero Given a position, neural network outputs both move probabilities P and value V (probability of winning) In each position, MCTS is conducted to return refined move probabilities π and game winner Z Neural network parameters are updated to make P and V better match π and Z Reminiscent of policy iteration: self-play with MCTS is policy evaluation, updating the network towards MCTS output is policy improvement D. Silver et al., Mastering the Game of Go without Human Knowledge, Nature 550, October 2017 https://deepmind.com/blog/alphago-zero-learning-scratch/

https://deepmind.com/blog/alphago-zero-learning-scratch/ AlphaGo Zero D. Silver et al., Mastering the Game of Go without Human Knowledge, Nature 550, October 2017 https://deepmind.com/blog/alphago-zero-learning-scratch/

https://deepmind.com/blog/alphago-zero-learning-scratch/ AlphaGo Zero It’s also more efficient than older engines! The thermal design power (TDP), sometimes called thermal design point, is the maximum amount of heat generated by a computer chip or component (often the CPU or GPU) that the cooling system in a computer is designed to dissipate under any workload. D. Silver et al., Mastering the Game of Go without Human Knowledge, Nature 550, October 2017 https://deepmind.com/blog/alphago-zero-learning-scratch/

Imitation learning In some applications, you cannot bootstrap yourself from random policies High-dimensional state and action spaces where most random trajectories fail miserably Expensive to evaluate policies in the physical world, especially in cases of failure Solution: learn to imitate sample trajectories or demonstrations This is also helpful when there is no natural reward formulation

Learning visuomotor policies Underlying state x: true object position, robot configuration Observations o: image pixels Two-part approach: Learn guiding policy π(a|x) using trajectory-centric RL and control techniques Learn visuomotor policy π(a|o) by imitating π(a|x) S. Levine et al. End-to-end training of deep visuomotor policies. JMLR 2016

Learning visuomotor policies Neural network architecture S. Levine et al. End-to-end training of deep visuomotor policies. JMLR 2016

Learning visuomotor policies Overview video, training video S. Levine et al. End-to-end training of deep visuomotor policies. JMLR 2016

Summary Deep Q learning Policy gradient methods Actor-critic Advantage actor-critic A3C Policy iteration for AlphaGo Imitation learning for visuomotor policies