Proximal Policy Optimization is that we want improve training stability of the policy by limiting the change you make to the policy at each training epoch
2 reasons behind this:
- Smaller policy changes = more likely to converge to optimal solution
- Too big step in a policy update can result in falling “off the cliff” (getting a bad policy) and having a long time and even no possibility to recover
PPO = update policy conservatively, this is done by measuring how much policy changed to the former using a ratio calculation between current and former policy, clip this ratio in a range
Function made to avoid destructive large weight updates:
Ratio function:
- if the action at the state is in current policy than the old one
- if is between 0 and 1, the action is less likely for current policy than for old one