Proximal Policy Optimization

Proximal Policy Optimization is that we want improve training stability of the policy by limiting the change you make to the policy at each training epoch

2 reasons behind this:

Smaller policy changes = more likely to converge to optimal solution
Too big step in a policy update can result in falling “off the cliff” (getting a bad policy) and having a long time and even no possibility to recover

PPO = update policy conservatively, this is done by measuring how much policy changed to the former using a ratio calculation between current and former policy, clip this ratio in a range $[1 - ϵ, 1 + ϵ]$

Function made to avoid destructive large weight updates:

L^{C L I P} (θ) = \hat{E}_{t} [min (r_{t} (θ) \hat{A}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) \hat{A}_{t})]

Ratio function: $r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{old} (a_{t} ∣ s_{t})}}$

if $r_{t} (θ) > 1$ the action $a_{t}$ at the state $s_{t}$ is in current policy than the old one
if $r_{t} (θ)$ is between 0 and 1, the action is less likely for current policy than for old one

Ayush Garg

Recently Updated

Hi, I am Ayush 1️⃣9️⃣

Flash Attention

Hyperic

Obsidian

Proximal Policy Optimization

Graph View

Backlinks