Advantage tells us how much better a given action is compared to the average action in a given state.
In policy gradient methods, the advantage estimate
- positive advantage: action was better than expected
- negative advantage: action was worse than expected
Generalized Advantage Estimation (GAE) is a way to estimate advantage by combining many temporal difference (TD) residuals. It trades off bias and variance using the parameter
There are 2 basic approaches that can be used to compute the advantage, and GAE interpolates between them.
Monte Carlo (MC)
The advantage is computed as the difference between the cumulative future reward and the value function for the current state
where:
MC estimates have low bias because they use the actual observed return from the trajectory. However, they have high variance, so they often need many samples to produce an accurate advantage estimate.
Temporal Difference (TD)
Core idea: instead of waiting until the end of a full trajectory to know how good an action was, estimate it using the immediate observed reward and the critic’s estimate of future value.
The temporal difference residual uses value predictions from the critic to form a one-step estimate of the advantage:
- TD residual at timestep - observed reward at timestep - discount factor - value function predicted by the critic - critic’s estimate of future value
If
N-step estimators
One-step TD only looks one step into the future. We can generalize this idea to capture any number of steps.
The
Examples:
Larger
GAE Formula
GAE combines all multi-step advantage estimates using exponentially decaying weights controlled by
The common TD-residual form is:
For a finite trajectory, this becomes:
So GAE is a discounted sum of future TD errors. Recent TD errors matter more, and farther TD errors matter less depending on
Bias-Variance Tradeoff
: pure one-step TD, lower variance, higher bias : close to Monte Carlo returns, lower bias, higher variance : balances bias and variance
Intuition:
- smaller
trusts the critic more - larger
trusts the sampled trajectory more
In practice, PPO commonly uses a high value like
Use in PPO
PPO uses GAE to compute
where:
GAE handles the advantage estimation problem, while PPO handles the policy update stability problem.