Blog Link: https://cameronrwolfe.substack.com/p/grpo
Group Relative Policy Optimization (GRPO) is heavily based on Proximal Policy Optimization (PPO)
Policy update:
Advantage:
PPO Loss:
GRPO Loss:
- refers to the model before we started the fine-tuning stage altogether
KL Divergence
When training models with PPO we incorporate KL Divergence between current and reference policy, the divergence serves as penalty encourages similarity between current and reference policies
KL divergence is computed by comparing token distributions from 2 LLMs for each token in a sequence
Easiest way to approximate divergence is the difference in log probabilities between policy and reference