Advantage tells us how much better a given action is compared to the average action in a given state.

In policy gradient methods, the advantage estimate controls whether the probability of an action should increase or decrease:

  • positive advantage: action was better than expected
  • negative advantage: action was worse than expected

Generalized Advantage Estimation (GAE) is a way to estimate advantage by combining many temporal difference (TD) residuals. It trades off bias and variance using the parameter .

There are 2 basic approaches that can be used to compute the advantage, and GAE interpolates between them.

Monte Carlo (MC)

The advantage is computed as the difference between the cumulative future reward and the value function for the current state predicted by the critic:

where:

MC estimates have low bias because they use the actual observed return from the trajectory. However, they have high variance, so they often need many samples to produce an accurate advantage estimate.

Temporal Difference (TD)

Core idea: instead of waiting until the end of a full trajectory to know how good an action was, estimate it using the immediate observed reward and the critic’s estimate of future value.

The temporal difference residual uses value predictions from the critic to form a one-step estimate of the advantage:

  • - TD residual at timestep
  • - observed reward at timestep
  • - discount factor
  • - value function predicted by the critic
  • - critic’s estimate of future value

If , the outcome was better than the critic expected. If , it was worse than expected.

N-step estimators

One-step TD only looks one step into the future. We can generalize this idea to capture any number of steps.

The -step advantage estimate is:

Examples:

Larger uses more real rewards before bootstrapping from the critic. This usually reduces bias but increases variance.

GAE Formula

GAE combines all multi-step advantage estimates using exponentially decaying weights controlled by .

The common TD-residual form is:

For a finite trajectory, this becomes:

So GAE is a discounted sum of future TD errors. Recent TD errors matter more, and farther TD errors matter less depending on .

Bias-Variance Tradeoff

controls how much GAE behaves like one-step TD versus Monte Carlo estimation:

  • : pure one-step TD, lower variance, higher bias
  • : close to Monte Carlo returns, lower bias, higher variance
  • : balances bias and variance

Intuition:

  • smaller trusts the critic more
  • larger trusts the sampled trajectory more

In practice, PPO commonly uses a high value like because it keeps variance manageable while still using information from multiple future rewards.

Use in PPO

PPO uses GAE to compute , then plugs that advantage estimate into the clipped policy objective:

where:

GAE handles the advantage estimation problem, while PPO handles the policy update stability problem.