Proximal Policy Optimization is that we want improve training stability of the policy by limiting the change you make to the policy at each training epoch

2 reasons behind this:

  • Smaller policy changes = more likely to converge to optimal solution
  • Too big step in a policy update can result in falling “off the cliff” (getting a bad policy) and having a long time and even no possibility to recover

PPO = update policy conservatively, this is done by measuring how much policy changed to the former using a ratio calculation between current and former policy, clip this ratio in a range

Function made to avoid destructive large weight updates:

Ratio function:

  • if the action at the state is in current policy than the old one
  • if is between 0 and 1, the action is less likely for current policy than for old one