Proximal Policy Optimization (PPO) is a policy gradient algorithm that improves training stability by limiting how much the policy can change during each update.
The core problem is that policy-gradient methods optimize from sampled rollouts. If the policy update is too large, the samples collected under the old policy may no longer describe the new policy well, and the update can destroy the behavior that produced the data.
PPO’s answer is: update the policy, but keep the new policy close to the policy that collected the data.
Intuition
PPO tries to avoid destructive policy updates.
- Small policy changes are more likely to improve the policy smoothly.
- Large policy changes can push the agent into a bad region where future rollouts are much worse.
- Reusing rollout data only makes sense while the new policy is still close to the old one.
Instead of directly constraining parameter distance, PPO constrains the probability ratio between the new policy and the old policy:
where:
is the policy currently being trained is the policy that generated the rollout data means the new policy assigns more probability to action means the new policy assigns less probability to action
Clipped surrogate objective
PPO uses a clipped surrogate objective:
The clipping range is:
A common value is
How the clipping works
The advantage estimate
If
If
So PPO still follows the policy gradient signal, but it removes the incentive to push the new policy too far away from the old policy on a single batch of data.
Training loop
A typical PPO update looks like:
- Run the current policy in the environment to collect trajectories.
- Estimate returns and advantages, often with GAE.
- Store the old action log probabilities from the rollout policy.
- Train the policy for several minibatch epochs using the clipped objective.
- Train a value function to predict returns.
- Repeat with fresh on-policy data.
Full objective
In practice, PPO is usually optimized with a combined actor-critic objective:
where:
updates the policy trains the value function is an entropy bonus that encourages exploration and control the value-loss and entropy terms