Policy Gradient Methods

Policy Gradient Methods
Sources: Stanford CS 231n, Berkeley Deep RL course, David Silver’s RL course Image source

Policy Gradient Methods
Instead of indirectly representing the policy using Q-values, it can be more efficient to parameterize and learn it directly Especially in large or continuous action spaces Image source: OpenAI Gym

Stochastic policy representation
Learn a function giving the probability distribution over actions from the current state: 𝜋 𝜃 𝑠,𝑎 ≈𝑃(𝑎|𝑠)

Learn a function giving the probability distribution over actions from the current state: 𝜋 𝜃 𝑠,𝑎 ≈𝑃(𝑎|𝑠) Why stochastic policies? There are examples even of grid world scenarios where only a stochastic policy can reach optimality ? ? Source: D. Silver The agent can’t tell the difference between the gray cells

Learn a function giving the probability distribution over actions from the current state: 𝜋 𝜃 𝑠,𝑎 ≈𝑃(𝑎|𝑠) Why stochastic policies? It’s mathematically convenient! Softmax policy: 𝜋 𝜃 𝑠,𝑎 = exp⁡( 𝑓 𝜃 𝑠,𝑎 ) 𝑎 ′ exp⁡( 𝑓 𝜃 𝑠, 𝑎 ′ ) Gaussian policy (for continuous action spaces): 𝜋 𝜃 𝑠,𝑎 = 1 2𝜋 𝜎 2 exp⁡ − (𝑎− 𝑓 𝜃 𝑠 ) 2 2𝜎 2

Expected value of a policy
𝐽 𝜃 =𝔼 𝑡≥0 𝛾 𝑡 𝑟 𝑡 | 𝜋 𝜃 = 𝔼 𝜏 𝑟(𝜏) = 𝜏 𝑟 𝜏 𝑝 𝜏;𝜃 𝑑𝜏 Expectation of return over trajectories 𝜏=( 𝑠 0 , 𝑎 0 , 𝑟 0 , 𝑠 1 , 𝑎 1 , 𝑟 1 ,…) Probability of trajectory 𝜏 under policy with parameters 𝜃

Finding the policy gradient
𝐽 𝜃 = 𝜏 𝑟 𝜏 𝑝 𝜏;𝜃 𝑑𝜏 ∇ 𝜃 𝐽(𝜃)= 𝜏 𝑟 𝜏 ∇ 𝜃 𝑝 𝜏;𝜃 𝑑𝜏 = 𝜏 𝑟 𝜏 𝑝 𝜏;𝜃 ∇ 𝜃 𝑝 𝜏;𝜃 𝑝 𝜏;𝜃 𝑑𝜏 = 𝜏 𝑟 𝜏 𝑝 𝜏;𝜃 ∇ 𝜃 log 𝑝 𝜏;𝜃 𝑑𝜏 = 𝔼 𝜏 𝑟(𝜏) ∇ 𝜃 log 𝑝 𝜏;𝜃

∇ 𝜃 𝐽(𝜃)= 𝔼 𝜏 𝑟(𝜏) ∇ 𝜃 log 𝑝 𝜏;𝜃 𝑝 𝜏;𝜃 = 𝑡≥0 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 )𝑃 𝑠 𝑡+1 𝑠 𝑡 , 𝑎 𝑡 log 𝑝 𝜏;𝜃 = 𝑡≥0 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) + log 𝑃 𝑠 𝑡+1 𝑠 𝑡 , 𝑎 𝑡 ∇ 𝜃 log 𝑝 𝜏;𝜃 = 𝑡≥0 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) Probability of trajectory 𝜏=( 𝑠 0 , 𝑎 0 , 𝑠 1 , 𝑎 1 ,…) The score function

Score function ∇ 𝜃 log 𝜋 𝜃 (𝑠,𝑎)
For softmax policy: 𝜋 𝜃 𝑠,𝑎 = exp⁡( 𝑓 𝜃 𝑠,𝑎 ) 𝑎 ′ exp⁡( 𝑓 𝜃 𝑠, 𝑎 ′ ) ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) = ∇ 𝜃 𝑓 𝜃 𝑠,𝑎 − 𝑎 ′ 𝜋 𝜃 𝑠,𝑎′ ∇ 𝜃 𝑓 𝜃 𝑠, 𝑎 ′ For Gaussian policy: 𝜋 𝜃 𝑠,𝑎 = 1 2𝜋 𝜎 2 exp⁡ − (𝑎− 𝑓 𝜃 𝑠 ) 2 2𝜎 2 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) = 𝑎− 𝑓 𝜃 𝑠 𝜎 2 ∇ 𝜃 𝑓 𝜃 𝑠 −const.

∇ 𝜃 𝐽(𝜃)= 𝔼 𝜏 𝑟(𝜏) ∇ 𝜃 log 𝑝 𝜏;𝜃 ∇ 𝜃 log 𝑝 𝜏;𝜃 = 𝑡≥0 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) ∇ 𝜃 𝐽(𝜃)= 𝔼 𝜏 𝑡≥0 𝛾 𝑡 𝑟 𝑡 𝑡≥0 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) How do we estmate the gradient in practice? Return of trajectory 𝜏 Gradient of log-likelihood of actions under current policy

∇ 𝜃 𝐽(𝜃)= 𝔼 𝜏 𝑟(𝜏) ∇ 𝜃 log 𝑝 𝜏;𝜃 ∇ 𝜃 log 𝑝 𝜏;𝜃 = 𝑡≥0 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) ∇ 𝜃 𝐽(𝜃)= 𝔼 𝜏 𝑡≥0 𝛾 𝑡 𝑟 𝑡 𝑡≥0 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) Stochastic approximation: sample 𝑁 trajectories 𝜏 1 , …, 𝜏 𝑁 ∇ 𝜃 𝐽(𝜃)≈ 1 𝑁 𝑖=1 𝑁 𝑡=0 𝑇 𝑖 𝛾 𝑡 𝑟 𝑖,𝑡 𝑡=0 𝑇 𝑖 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑖,𝑡 , 𝑎 𝑖,𝑡 )

REINFORCE algorithm Sample 𝑁 trajectories 𝜏 𝑖 using current policy 𝜋 𝜃
Estimate the policy gradient: ∇ 𝜃 𝐽(𝜃)≈ 1 𝑁 𝑖=1 𝑁 𝑟( 𝜏 𝑖 ) 𝑡=0 𝑇 𝑖 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑖,𝑡 , 𝑎 𝑖,𝑡 ) 3. Update parameters by gradient ascent: 𝜃←𝜃+𝜂 ∇ 𝜃 𝐽(𝜃) Williams et al. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3): , 1992

REINFORCE: Single-step version
In state 𝑠, sample action 𝑎 using current policy 𝜋 𝜃 , observe reward 𝑟 Estimate the policy gradient: ∇ 𝜃 𝐽(𝜃)≈𝑟 ∇ 𝜃 log 𝜋 𝜃 (𝑠,𝑎) 3. Update parameters by gradient ascent: 𝜃←𝜃+𝜂 ∇ 𝜃 𝐽(𝜃) What effect does this update have? Push up the probability of good actions, push down probability of bad actions

Reducing variance Gradient estimate (for a single trajectory):
∇ 𝜃 𝐽(𝜃)≈ 𝑡≥0 𝑟(𝜏) ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) General problem: rewards of sampled trajectories are too noisy and lead to unreliable policy gradients

Reducing variance Gradient estimate (for a single trajectory):
∇ 𝜃 𝐽(𝜃)≈ 𝑡≥0 𝑟(𝜏) ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) Observation: it seems bad to weight each action in a trajectory by the return of the entire trajectory. In particular, rewards obtained before an action was taken should not be used to weight that action Instead, for each action, consider only the cumulative future reward: ∇ 𝜃 𝐽(𝜃)≈ 𝑡≥0 𝑡 ′ ≥𝑡 𝛾 𝑡 ′ −𝑡 𝑟 𝑡 ′ ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 )

Observed cumulative reward after taking action 𝑎 𝑡 in state 𝑠 𝑡
Reducing variance ∇ 𝜃 𝐽(𝜃)≈ 𝑡≥0 𝑡 ′ ≥𝑡 𝛾 𝑡 ′ −𝑡 𝑟 𝑡 ′ ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) But then, why not use expected cumulative reward? ∇ 𝜃 𝐽(𝜃)≈ 𝑡≥0 𝑄 𝜋 𝜃 𝑠 𝑡 , 𝑎 𝑡 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) Observed cumulative reward after taking action 𝑎 𝑡 in state 𝑠 𝑡

Actor-Critic algorithm
Combine policy gradients and Q-learning by simultaneously training an actor (the policy) and a critic (the Q-function) Source: D. Silver

Reducing variance ∇ 𝜃 𝐽(𝜃)≈ 𝑡≥0 𝑄 𝜋 𝜃 𝑠 𝑡 , 𝑎 𝑡 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) Next observation: the raw Q-values are not the most useful. If all Q-values are good, we will try to push up the probabilities of all the actions Instead, introduce a baseline function to represent whether the current value is better or worse than what we expect to get ∇ 𝜃 𝐽(𝜃)≈ 𝑡≥0 𝑄 𝜋 𝜃 𝑠 𝑡 , 𝑎 𝑡 − 𝑉 𝜋 𝜃 𝑠 𝑡 ∇ 𝜃 log 𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) Advantage function

Estimating the advantage function
𝐴 𝜋 𝜃 𝑠,𝑎 = 𝑄 𝜋 𝜃 𝑠,𝑎 − 𝑉 𝜋 𝜃 𝑠 = 𝔼 𝜋 𝜃 𝑟+𝛾 𝑉 𝜋 𝜃 𝑠 ′ |𝑠,𝑎 − 𝑉 𝜋 𝜃 𝑠 ≈𝑟+𝛾 𝑉 𝜋 𝜃 𝑠 ′ − 𝑉 𝜋 𝜃 𝑠 (from a single transition) Therefore, it is sufficient to learn the value function: 𝑉 𝜋 𝜃 𝑠 ≈ 𝑉 𝑤 (𝑠)

Online actor-critic algorithm
Sample action 𝑎 using current policy, observe reward 𝑟, successor state 𝑠′ Update 𝑉 𝑤 (𝑠) towards target 𝑟+𝛾 𝑉 𝑤 (𝑠′) Estimate 𝐴 𝜋 𝜃 𝑠,𝑎 =𝑟+𝛾 𝑉 𝑤 𝑠 ′ − 𝑉 𝑤 (𝑠) Estimate ∇ 𝜃 𝐽 𝜃 = 𝐴 𝜋 𝜃 𝑠,𝑎 ∇ 𝜃 log 𝜋 𝜃 (𝑠,𝑎) Update policy parameters: 𝜃←𝜃+𝜂 ∇ 𝜃 𝐽(𝜃) Source: Berkeley RL course

Asynchronous advantage actor-critic (A3C)
Agent 1 Experience 1 Local updates Local updates Agent 2 Experience 2 . V, π Local updates Agent n Experience n Asynchronously update global parameters Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016

Mean and median human-normalized scores over 57 Atari games Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016

TORCS car racing simulation video Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016

Motor control tasks video Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016

Benchmarks and environments for Deep RL
OpenAI Gym

Policy Gradient Methods

Similar presentations

Presentation on theme: "Policy Gradient Methods"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Policy Gradient Methods

Similar presentations

Presentation on theme: "Policy Gradient Methods"— Presentation transcript:

Similar presentations

About project

Feedback