Proximal Policy Optimization (PPO) is a policy gradient algorithm that improves training stability by limiting how much the policy can change during each update.

The core problem is that policy-gradient methods optimize from sampled rollouts. If the policy update is too large, the samples collected under the old policy may no longer describe the new policy well, and the update can destroy the behavior that produced the data.

PPO’s answer is: update the policy, but keep the new policy close to the policy that collected the data.

Intuition

PPO tries to avoid destructive policy updates.

  • Small policy changes are more likely to improve the policy smoothly.
  • Large policy changes can push the agent into a bad region where future rollouts are much worse.
  • Reusing rollout data only makes sense while the new policy is still close to the old one.

Instead of directly constraining parameter distance, PPO constrains the probability ratio between the new policy and the old policy:

where:

  • is the policy currently being trained
  • is the policy that generated the rollout data
  • means the new policy assigns more probability to action
  • means the new policy assigns less probability to action

Clipped surrogate objective

PPO uses a clipped surrogate objective:

The clipping range is:

A common value is , which means the objective stops rewarding changes once the action probability is roughly more than 20% above or below the old policy’s probability.

How the clipping works

The advantage estimate tells PPO whether an action was better or worse than expected. PPO commonly computes this using GAE.

If , the action was better than expected, so PPO wants to increase its probability. But once is above , the clipped objective stops giving extra reward for increasing it further.

If , the action was worse than expected, so PPO wants to decrease its probability. But once is below , the clipped objective stops giving extra reward for decreasing it further.

So PPO still follows the policy gradient signal, but it removes the incentive to push the new policy too far away from the old policy on a single batch of data.

Training loop

A typical PPO update looks like:

  1. Run the current policy in the environment to collect trajectories.
  2. Estimate returns and advantages, often with GAE.
  3. Store the old action log probabilities from the rollout policy.
  4. Train the policy for several minibatch epochs using the clipped objective.
  5. Train a value function to predict returns.
  6. Repeat with fresh on-policy data.

Full objective

In practice, PPO is usually optimized with a combined actor-critic objective:

where:

  • updates the policy
  • trains the value function
  • is an entropy bonus that encourages exploration
  • and control the value-loss and entropy terms