Blog Link: https://cameronrwolfe.substack.com/p/grpo

Group Relative Policy Optimization (GRPO) is heavily based on Proximal Policy Optimization (PPO)

Policy update:

Advantage:

PPO Loss:

GRPO Loss:

  • - refers to the model before we started the fine-tuning stage altogether

KL Divergence

When training models with PPO we incorporate KL Divergence between current and reference policy, the divergence serves as penalty encourages similarity between current and reference policies

KL divergence is computed by comparing token distributions from 2 LLMs for each token in a sequence

Easiest way to approximate divergence is the difference in log probabilities between policy and reference