Presentation is loading. Please wait.

Presentation is loading. Please wait.

Policy Gradient Methods

Similar presentations


Presentation on theme: "Policy Gradient Methods"— Presentation transcript:

1 Policy Gradient Methods
Sources: Stanford CS 231n, Berkeley Deep RL course, David Silver’s RL course Image source

2 Policy Gradient Methods
Instead of indirectly representing the policy using Q-values, it can be more efficient to parameterize and learn it directly Especially in large or continuous action spaces Image source: OpenAI Gym

3 Stochastic policy representation
Learn a function giving the probability distribution over actions from the current state: 𝜋 𝜃 𝑠,𝑎 ≈𝑃(𝑎|𝑠)

4 Stochastic policy representation
Learn a function giving the probability distribution over actions from the current state: 𝜋 𝜃 𝑠,𝑎 ≈𝑃(𝑎|𝑠) Why stochastic policies? There are examples even of grid world scenarios where only a stochastic policy can reach optimality ? ? Source: D. Silver The agent can’t tell the difference between the gray cells

5 Stochastic policy representation
Learn a function giving the probability distribution over actions from the current state: 𝜋 𝜃 𝑠,𝑎 ≈𝑃(𝑎|𝑠) Why stochastic policies? It’s mathematically convenient! Softmax policy: 𝜋 𝜃 𝑠,𝑎 = exp⁡( 𝑓 𝜃 𝑠,𝑎 ) 𝑎 ′ exp⁡( 𝑓 𝜃 𝑠, 𝑎 ′ ) Gaussian policy (for continuous action spaces): 𝜋 𝜃 𝑠,𝑎 = 1 2𝜋 𝜎 2 exp⁡ − (𝑎− 𝑓 𝜃 𝑠 ) 2 2𝜎 2

6 Expected value of a policy
𝐽 𝜃 =𝔼 𝑡≥0 𝛾 𝑡 𝑟 𝑡 | 𝜋 𝜃 = 𝔼 𝜏 𝑟(𝜏) = 𝜏 𝑟 𝜏 𝑝 𝜏;𝜃 𝑑𝜏 Expectation of return over trajectories 𝜏=( 𝑠 0 , 𝑎 0 , 𝑟 0 , 𝑠 1 , 𝑎 1 , 𝑟 1 ,…) Probability of trajectory 𝜏 under policy with parameters 𝜃

7 Finding the policy gradient
𝐽 𝜃 = 𝜏 𝑟 𝜏 𝑝 𝜏;𝜃 𝑑𝜏 ∇ 𝜃 𝐽(𝜃)= 𝜏 𝑟 𝜏 ∇ 𝜃 𝑝 𝜏;𝜃 𝑑𝜏 = 𝜏 𝑟 𝜏 𝑝 𝜏;𝜃 ∇ 𝜃 𝑝 𝜏;𝜃 𝑝 𝜏;𝜃 𝑑𝜏 = 𝜏 𝑟 𝜏 𝑝 𝜏;𝜃 ∇ 𝜃 log 𝑝 𝜏;𝜃 𝑑𝜏 = 𝔼 𝜏 𝑟(𝜏) ∇ 𝜃 log 𝑝 𝜏;𝜃

8 Finding the policy gradient
∇ 𝜃 𝐽(𝜃)= 𝔼 𝜏 𝑟(𝜏) ∇ 𝜃 log 𝑝 𝜏;𝜃 𝑝 𝜏;𝜃 = 𝑡≥0 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 )𝑃 𝑠 𝑡+1 𝑠 𝑡 , 𝑎 𝑡 log 𝑝 𝜏;𝜃 = 𝑡≥0 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) + log 𝑃 𝑠 𝑡+1 𝑠 𝑡 , 𝑎 𝑡 ∇ 𝜃 log 𝑝 𝜏;𝜃 = 𝑡≥0 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) Probability of trajectory 𝜏=( 𝑠 0 , 𝑎 0 , 𝑠 1 , 𝑎 1 ,…) The score function

9 Score function ∇ 𝜃 log 𝜋 𝜃 (𝑠,𝑎)
For softmax policy: 𝜋 𝜃 𝑠,𝑎 = exp⁡( 𝑓 𝜃 𝑠,𝑎 ) 𝑎 ′ exp⁡( 𝑓 𝜃 𝑠, 𝑎 ′ ) ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) = ∇ 𝜃 𝑓 𝜃 𝑠,𝑎 − 𝑎 ′ 𝜋 𝜃 𝑠,𝑎′ ∇ 𝜃 𝑓 𝜃 𝑠, 𝑎 ′ For Gaussian policy: 𝜋 𝜃 𝑠,𝑎 = 1 2𝜋 𝜎 2 exp⁡ − (𝑎− 𝑓 𝜃 𝑠 ) 2 2𝜎 2 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) = 𝑎− 𝑓 𝜃 𝑠 𝜎 2 ∇ 𝜃 𝑓 𝜃 𝑠 −const.

10 Finding the policy gradient
∇ 𝜃 𝐽(𝜃)= 𝔼 𝜏 𝑟(𝜏) ∇ 𝜃 log 𝑝 𝜏;𝜃 ∇ 𝜃 log 𝑝 𝜏;𝜃 = 𝑡≥0 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) ∇ 𝜃 𝐽(𝜃)= 𝔼 𝜏 𝑡≥0 𝛾 𝑡 𝑟 𝑡 𝑡≥0 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) How do we estmate the gradient in practice? Return of trajectory 𝜏 Gradient of log-likelihood of actions under current policy

11 Finding the policy gradient
∇ 𝜃 𝐽(𝜃)= 𝔼 𝜏 𝑟(𝜏) ∇ 𝜃 log 𝑝 𝜏;𝜃 ∇ 𝜃 log 𝑝 𝜏;𝜃 = 𝑡≥0 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) ∇ 𝜃 𝐽(𝜃)= 𝔼 𝜏 𝑡≥0 𝛾 𝑡 𝑟 𝑡 𝑡≥0 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) Stochastic approximation: sample 𝑁 trajectories 𝜏 1 , …, 𝜏 𝑁 ∇ 𝜃 𝐽(𝜃)≈ 1 𝑁 𝑖=1 𝑁 𝑡=0 𝑇 𝑖 𝛾 𝑡 𝑟 𝑖,𝑡 𝑡=0 𝑇 𝑖 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑖,𝑡 , 𝑎 𝑖,𝑡 )

12 REINFORCE algorithm Sample 𝑁 trajectories 𝜏 𝑖 using current policy 𝜋 𝜃
Estimate the policy gradient: ∇ 𝜃 𝐽(𝜃)≈ 1 𝑁 𝑖=1 𝑁 𝑟( 𝜏 𝑖 ) 𝑡=0 𝑇 𝑖 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑖,𝑡 , 𝑎 𝑖,𝑡 ) 3. Update parameters by gradient ascent: 𝜃←𝜃+𝜂 ∇ 𝜃 𝐽(𝜃) Williams et al. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3): , 1992

13 REINFORCE: Single-step version
In state 𝑠, sample action 𝑎 using current policy 𝜋 𝜃 , observe reward 𝑟 Estimate the policy gradient: ∇ 𝜃 𝐽(𝜃)≈𝑟 ∇ 𝜃 log 𝜋 𝜃 (𝑠,𝑎) 3. Update parameters by gradient ascent: 𝜃←𝜃+𝜂 ∇ 𝜃 𝐽(𝜃) What effect does this update have? Push up the probability of good actions, push down probability of bad actions

14 Reducing variance Gradient estimate (for a single trajectory):
∇ 𝜃 𝐽(𝜃)≈ 𝑡≥0 𝑟(𝜏) ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) General problem: rewards of sampled trajectories are too noisy and lead to unreliable policy gradients

15 Reducing variance Gradient estimate (for a single trajectory):
∇ 𝜃 𝐽(𝜃)≈ 𝑡≥0 𝑟(𝜏) ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) Observation: it seems bad to weight each action in a trajectory by the return of the entire trajectory. In particular, rewards obtained before an action was taken should not be used to weight that action Instead, for each action, consider only the cumulative future reward: ∇ 𝜃 𝐽(𝜃)≈ 𝑡≥0 𝑡 ′ ≥𝑡 𝛾 𝑡 ′ −𝑡 𝑟 𝑡 ′ ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 )

16 Observed cumulative reward after taking action 𝑎 𝑡 in state 𝑠 𝑡
Reducing variance ∇ 𝜃 𝐽(𝜃)≈ 𝑡≥0 𝑡 ′ ≥𝑡 𝛾 𝑡 ′ −𝑡 𝑟 𝑡 ′ ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) But then, why not use expected cumulative reward? ∇ 𝜃 𝐽(𝜃)≈ 𝑡≥0 𝑄 𝜋 𝜃 𝑠 𝑡 , 𝑎 𝑡 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) Observed cumulative reward after taking action 𝑎 𝑡 in state 𝑠 𝑡

17 Actor-Critic algorithm
Combine policy gradients and Q-learning by simultaneously training an actor (the policy) and a critic (the Q-function) Source: D. Silver

18 Reducing variance ∇ 𝜃 𝐽(𝜃)≈ 𝑡≥0 𝑄 𝜋 𝜃 𝑠 𝑡 , 𝑎 𝑡 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) Next observation: the raw Q-values are not the most useful. If all Q-values are good, we will try to push up the probabilities of all the actions Instead, introduce a baseline function to represent whether the current value is better or worse than what we expect to get ∇ 𝜃 𝐽(𝜃)≈ 𝑡≥0 𝑄 𝜋 𝜃 𝑠 𝑡 , 𝑎 𝑡 − 𝑉 𝜋 𝜃 𝑠 𝑡 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) Advantage function

19 Estimating the advantage function
𝐴 𝜋 𝜃 𝑠,𝑎 = 𝑄 𝜋 𝜃 𝑠,𝑎 − 𝑉 𝜋 𝜃 𝑠 = 𝔼 𝜋 𝜃 𝑟+𝛾 𝑉 𝜋 𝜃 𝑠 ′ |𝑠,𝑎 − 𝑉 𝜋 𝜃 𝑠 ≈𝑟+𝛾 𝑉 𝜋 𝜃 𝑠 ′ − 𝑉 𝜋 𝜃 𝑠 (from a single transition) Therefore, it is sufficient to learn the value function: 𝑉 𝜋 𝜃 𝑠 ≈ 𝑉 𝑤 (𝑠)

20 Online actor-critic algorithm
Sample action 𝑎 using current policy, observe reward 𝑟, successor state 𝑠′ Update 𝑉 𝑤 (𝑠) towards target 𝑟+𝛾 𝑉 𝑤 (𝑠′) Estimate 𝐴 𝜋 𝜃 𝑠,𝑎 =𝑟+𝛾 𝑉 𝑤 𝑠 ′ − 𝑉 𝑤 (𝑠) Estimate ∇ 𝜃 𝐽 𝜃 = 𝐴 𝜋 𝜃 𝑠,𝑎 ∇ 𝜃 log 𝜋 𝜃 (𝑠,𝑎) Update policy parameters: 𝜃←𝜃+𝜂 ∇ 𝜃 𝐽(𝜃) Source: Berkeley RL course

21 Asynchronous advantage actor-critic (A3C)
Agent 1 Experience 1 Local updates Local updates Agent 2 Experience 2 . V, π Local updates Agent n Experience n Asynchronously update global parameters Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016

22 Asynchronous advantage actor-critic (A3C)
Mean and median human-normalized scores over 57 Atari games Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016

23 Asynchronous advantage actor-critic (A3C)
TORCS car racing simulation video Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016

24 Asynchronous advantage actor-critic (A3C)
Motor control tasks video Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016

25 Benchmarks and environments for Deep RL
OpenAI Gym


Download ppt "Policy Gradient Methods"

Similar presentations


Ads by Google