Download presentation
Presentation is loading. Please wait.
Published byKazimiera Górska Modified over 5 years ago
1
Asynchronous Methods for Deep Reinforcement Learning
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P., David Silver, Koray Kavukcuoglu Google DeepMind Montreal Institute for Learning Algorithms (MILA), University of Montreal Journal reference: ICML 2016
2
Outline Introduction Related Work Reinforcement Learning Background
Asynchronous RL Framework Experiments Conclusions and Discussion
3
Outline Introduction Related Work Reinforcement Learning Background
Asynchronous RL Framework Experiments Conclusions and Discussion
4
Introduction Online RL algorithm with DNN is fundamentally unstable
sequence of observed data encounter by an online RL agent is non-stationary online RL updates are strongly correlated solution idea → experience replay Drawbacks of experience replay: uses more memory and computation per real interaction requires off-policy learning algorithm that can update from data generated by an older policy
5
Introduction In this paper:
Asynchronously execute multiple agents in parallel on multiple environments. (instead of experience replay) Benefits: decorrelate the agents’ data into a more stationary process can apply on both on-policy and off-policy RL algorithm GPU or massively distributed machines → multi core CPU (single machine) cost far less time than GPU-based algorithms
6
Outline Introduction Related Work Reinforcement Learning Background
Asynchronous RL Framework Experiments Conclusions and Discussion
7
Related Work: Gorila Architecture
Perform asynchronous training of reinforcement learning agents in a distributed setting. Actor acts in its own copy of environment separate replay memory Learner samples data from replay memory computes gradients of the DQN loss with respect to the policy parameters
8
Related Work: Gorila Architecture
Gradients are asynchronously sent to a central parameter server which updates a central copy of the model. Setting: 100 separate actor-learners 30 parameter servers instance 130 machines in total Performance: outperform DQN over 49 Atari games 20 times faster than DQN The updated policy parameters are sent to the actor-learners at fixed interval.
9
Related Work Map Reduce framework to parallelizing batch reinforcement learning with linear function approximation Parallelism was used to speed up large matrix operations Parallel version of Sarsa algorithm Multiple separate actor-learners to accelerate training Learns separately and periodically send updates to weights that have changed significantly to the other learners using peer-to-peer (P2P) communication. Not to parallelize the collection of experience or stabilize learning
10
Outline Introduction Related Work Reinforcement Learning Background
Asynchronous RL Framework Experiments Conclusions and Discussion
11
RL Background: Q-learning and Sarsa
12
RL Background: Actor-Critic
Actor-Critic follow an approximate policy gradient: With advantage function:
13
Outline Introduction Related Work Reinforcement Learning Background
Asynchronous RL Framework Experiments Conclusions and Discussion
14
Asynchronous RL Framework
source: future/simple-reinforcement-learning-with- tensorflow-part-8-asynchronous-actor-critic- agents-a3c-c88f72a5e9f2 Asynchronous RL Framework Copy global network parameters Worker interacts with environment Accumulate gradients Update global network with gradients Copy global network parameters Worker interacts with environment 17 Framework of A3C
15
Two main ideas of practice
Similarly to the Gorila framework, but: separate machines → multiple CPU threads on a single machine removes the communication costs of sending gradients and parameters Multiple learners running in parallel exploring different parts of the environment maximize the diversity be less correlated in time (decorrelate) do not use replay memory
16
Asynchronous RL Framework
one-step Q-learning Asynchronous one-step Q-learning one-step Sarsa Asynchronous one-step Sarsa n-step Q-learning Asynchronous n-step Q-learning actor-critic Asynchronous advantage actor-critic (A3C)
17
Algo: DQN v.s. Asynchronous one-step Q-learning
Deep Q Network (DQN) Asynchronous one-step Q-learning
18
Algo: Asynchronous Advantage Actor-Critic
can add entropy regularization (H): +
19
Optimization SGD with momentum RMSProp without shared statistics
RMSProp with shared statistics (mode robust) → RMSProp where statistics g are shared across threads
20
Outline Introduction Related Work Reinforcement Learning Background
Asynchronous RL Framework Experiments Conclusions and Discussion
21
Experiments Atari 2600 Games (experiment on 57 games)
TORCS Car Racing Simulator (3D game) MuJoCo Physics Simulator (Continuous Action Control) Labyrinth (3D maze game)
22
Experimental setup (A3C on Atari and TORCS)
number of threads: 16 (on a single machine and no GPUs) updates every 5 actions (𝑡"#$ = 5 and 𝐼()*#+, = 5) optimization: shared RMSProp network architecture: 2 Conv layers and 1 FC layer (followed by ReLU) input preprocessing and network architecture as (Mnih et al., 2015; 2013) - learning rate: sample from a 𝐿𝑜𝑔𝑈𝑛𝑖𝐹𝑜𝑟𝑚(10>?, 10>A) distribution discount factor 𝛾 = 0.99, RMSProp decay factor 𝛼 = 0.99, entropy weight 𝛽 = 0.01 27
23
Learning speed comparison (5 Atari games)
DQN: train on Nvidia K40 GPU Asynchronous methods: train on 16 CPU cores 28
24
Score result on 57 Atari games
Fix all hyperparameters for all 57 games A3C, LSTM: add 256 LSTM cells after final hidden layer (for more compare) Mean, median: human-normalized scores on 57 Atari games 29
25
A3C on other environments
TORCS 3D car racing game: MuJoCo physics simulator: Labyrinth: 30
26
Scalability and Data Efficiency
The speedups average over 7 Atari games 31
27
Robustness and Stability
50 different learning rates and random initializations Result: Robust to the choice of learning rate and random initialization Stable and do not collapse or diverage once they are learning (4 asynchronous methods have the same conclusion) 32
28
Comparison of three optimization methods
50 experiments on n-step Q and A3C with 50 different random learning rates and initializations Momentum SGD RMSProp Shared RMSProp 33
29
Outline Introduction Related Work Reinforcement Learning Background
Asynchronous RL Framework Experiments Conclusions and Discussion
30
Conclusions and Discussion
In this framework Stable training of NN is possible in many situations (value/policy-base, on/off-policy, discrete/continuous) Reduce the consumption of time Could be potentially improved by using other ways of estimating advantage function A number of complementary improvements to the NN architecture are possible.
31
Q & A
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.